CSV on the Web: Working with units of measure
This post looks at how to include units (such as metres, kilograms or degrees Celsius) into CSVW metadata files.
Introduction
CSV on the Web (CSVW) is a W3C standard for describing the data stored in CSV files using an additional metadata.json file. It is designed to be flexible and suitable for many different types of data.
Many CSV data files include columns which are measured using a unit of measure, such as metres, kilograms or degrees Celsius. It’s important to be able to include the information about the unit of measure in the CSVW metadata.json file, so that others can understand and reuse the dataset.
However, as stated in the CSVW Primer, the CSVW standard ‘no native support for expressing the units of measure for a particular column’. It’s not clear why this is - perhaps units of measure go beyond the scope of what the designers of the standard had in mind, or perhaps it was too difficult to commit a single method of expressing the units at the time.
Fortunately, the CSVW Primer does give a number of examples of how to include units, in Section 6.1 ‘How do you support units of measure?’. This blog post looks through these examples and discusses the options for including units in CSVW datasets.
Option 1: Informal descriptions
Let’s assume we have the following CSV file which is named ‘bike_journeys.csv’:
This dataset describes the distance of three different bike journeys. The ‘journey_id’ column is an identifier for each journey and the ‘distance’ column is the length of each journey measured in kilometres.
Based on Example 99 of the CSVW Primer, the first option for including the units of measure is to use an ‘informal description’ within the column. The metadata.json file would then be:
Here the units of measure (kilometres) are stated in a description of the ‘distance’ column using the dc:description property. This is good as the information about the units is now contained in the metadata file rather than, say, in a hard-to-read readme.txt file, Word document or webpage. However there are two limitations to this approach:
- The information about the units of measure (kilometres) is provided in a descriptive, informal manner. This means that a human reading the description may understand it, but a machine reading the description would not. If I was writing a computer programme to analyse the dataset and wanted to use the units of the distance column, it is likely that I would need to rewrite out (i.e. hard code) the ‘kilometres’ units as an input to the programme.
- The information about the units of measure is provided as a text string, i.e. “kilometres”. Again a human reading this will understand what is meant but a computer may not. The problem arises as there are other ways to express the same units, such as “kilometre” or “km”, and there is the potential for different spellings, such as “kilometers”. A more precise approach is to use a Unique Resource Identifier (URI) to represent the concept of kilometres.
Option 2: Using Unique Resource Identifiers
Building on Example 100 of the CSVW primer, we can revise the metadata.json file using ‘an existing units-of-measure property and vocabulary’ and express the units of measure using Unique Resource Identifiers (URIs):
This is the same metadata file as for the informal description approach (Option 1) with a new property included in the ‘distance’ Column object.
The new property name
The new property has a property name of http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure
. This is a URI as it represents a unique identifier for a particular concept. Web addresses starting with http://
can generally be thought of as globally unique as they are controlled by the person who owns the web domain. Often these kind of URIs will resolve to a web page. You can check this by pasting the URI into a web browser. Doing this for the above URI will display a page of text - this is an OWL document describing the ‘sdmx-attribute’ vocabulary (the document is written in the Turtle dialect of RDF). Scrolling down towards the end of the text document we can see the following text:
This describes various features and characteristics of the URI including a description of what it represents, i.e. “The unit in which the data values are measured”. For this example we don’t need to understand all of the details in the sdmx-attribute vocabulary, it is enough to know what this URI represents and that we can use it for describing units in CSVW files.
How did we know to use this particular URI for the property name for units in the Column description? Well the answer is that there are probably many different URIs that could be used in this case, as many different URI vocabularies exist and have been published. However it’s reasonable to assume that since this is the example given in the CSVW Primer, this is the URI that it’s likely others will also use and so it makes sense in the first instance for us to use this one as well. Ultimately these decisions are decided by the community of researchers working with this type of data - if a convention is established by the discipline then this will be used by researchers when publishing their data.
The new property value
The new property has a value of {"@id": "<http://qudt.org/vocab/unit/KiloM>"}
. This is a JSON object and the identifier for the object, given by the @id
property is the URI http://qudt.org/vocab/unit/KiloM
. This URI represent the concept of the units of measure of kilometres. Again, if we put this URI into a web browser we find another OWL vocabulary, this time the QUDT units vocabulary. Within this document we can see the definition of the kilometres URI:
As before, we don’t need to understand all the details here. The important point is that the http://qudt.org/vocab/unit/KiloM
URI is a unique way of referring to the concept of kilometres. By using it we completely reduce the chance of misunderstanding or misinterpretation. We can also write computer programme that can automatically work with such URIs which might greatly help to automate data analysis tasks.
Why do we use the QUDT vocabulary here? Again, there is no specific reason and other units of measure vocabularies exists. However as QUDT is used as the example in the CSVW Primer, I think it makes sense to use it in this case. If you want to see the other units available in the QUDT vocabulary, these are available to view here: http://www.qudt.org/doc/DOC_VOCAB-UNITS.html
Further examples using Unique Identifiers
Degrees Celsius
A Column object in a metadata.json file which describes a column of room air temperatures recorded in degrees Celsius might look like this:
How did I know what the URI for degrees Celsius was? I went to the QUDT units webpage and scanned the page until I found the URI I needed.
Percentage Relative Humidity
A Column object in a metadata.json file which describes a column of percentage relative humidity measurements recorded in a room might look like this:
Interestingly, although percentage relative humidity has dimensionless units (as it is a percentage) there is still a dedicated unit for the concept in the QUDT units vocabulary.
Summary
This blog post describes two methods for including units of measure in Column objects of CSVW metadata.json files: i) Informal Descriptions; and ii) Using Unique Identifiers. Both methods are valid and the Using Unique Identifiers approach builds on the Informal Descriptions approach.
I would recommend including both in CSVW files. This way if another user is unfamiliar with Unique Identifiers they will still understand the units being used from the informal descriptions. However by including Unique Identifiers this means that other users, familiar with URIs and computer programming, will find it easier to analyse the data. And using Unique Identifiers will bring the CSVW dataset closer to the concept of FAIR data which recommends the use of Unique Identifiers for both data and metadata.
A final observation is that the CSVW Primer also discusses a third method of including units of measure, in Section 6.1.1 ‘Supporting units of measure by transforming to structured values’. If you are interested in transforming your CSVW data into RDF data, perhaps structured according to an OWL ontology, then this section outlines a method to do this.