CSV on the Web: An Introduction
Are you working with CSV files but are not sure how to best describe the data within them? Would you like to share your CSV data according to the FAIR data principles? CSV on the Web (CSVW) might be the answer.
Introduction
The CSV on the Web (CSVW) standard is a method for publishing and sharing data held within CSV files. CSV stands for Comma Separated Variable files which are used in many situations to hold tabular data. They can be exported from Microsoft Excel spreadsheets and are often the format in which Open Datasets are stored.
CSVW expands on the concept of CSV files by also including a second data file, the metadata file, which holds information about the table itself and about the variables represented by each column in the table. This additional metadata means that the dataset becomes much easier for others to understand and reuse the dataset.
I think that CSVW provides a great opportunity for publish more data as FAIR data. It has the advantage that it uses the CSV format which everyone who works with data will understand, but it can also be extended to make use of Unique Identifiers which are required by the FAIR guidelines. It’s worth noting that on the CSVW webpage it is stated that the standard is now actively recommended by the UK Government Digital Service.
Why do we need CSVW?
Imagine you find some CSV data on the web. You download the data and open the file. In the dataset is a column which is labelled ‘Temperature’. Immediately there are a number of questions about this column of data:
- What kind of temperature is this - air temperature, surface temperature etc.?
- What is this the temperature of - the air in a room, the surface of a wall etc.?
- What units are being used - Celsius or Fahrenheit?
- How was the temperature recorded? What kind of sensor was used?
- Who carried out the study and how can we find out more information?
To publish Open Data and make datasets understandable for others, we need to answer questions like these. In the worst case, the information above is not provided and this means that the data is almost certainly unusable. In the next case these questions are answered but the information is placed in a readme.txt file, in a accompanying Word document or on the webpage where the data resides. This is better but the issue here is that the information is being shared in an ad-hoc manner, is only readable by humans ad not by computers, and there is the potential for misinterpretion.
CSVW solves this problem by providing a standard to sharing the information which answers the kind of questions as given above. This is termed the metadata of the CSV file, additional information which describes the characteristics of the data in the CSV file.
How does CSVW work?
The basic principle of CSVW is that an additional metadata file is created to accompany the CSV file. So let’s assume that we have a CSV file we wish to share. The CSV file is named ‘my_dataset.csv’. We would then publish the following files to share the dataset:
- ‘my_dataset.csv’ - the file containing the data
- ‘my_dataset.csv-metadata.json’ - the file containing the metadata
The ‘my_dataset.csv-metadata.json’ file is an additional file that we would manually create as part of the process of sharing the dataset. This is a file created using the JSON format. JSON stands for JavaScript Object Notation and is a file format that is widely used in programming and for sharing information over the internet. JSON is a text file and can be created using text editors such as Notepad or even better by dedicated JSON editors, for example Visual Studio Code. The advantages of using JSON are that both humans and computers can read the information stored within them and that they can store information in a nested manner (for example lists within lists) which is very useful for providing descriptions of other things.
How do I write a CSVW metadata file?
My suggestion for writing a CSVW metadata file would be:
- Open a JSON text editor such as Visual Studio Code.
- Copy an example as given in the CSV on the Web: A Primer webpage.
- Amend the example to suit the CSV file that you are working with.
The CSVW Primer webpage gives a good introduction to the concepts and many examples of how the meta file files might be structured. To understand the exact rules behind creating the metadata.json files, these are given in a separate webpage named Metadata Vocabulary for Tabular Data.
An important concept which isn’t immediately apparent when looking at the CSVW standards is that the metadata.json file can actually describe different objects. It could be used to describe the structure of a group of tables (a TableGroup object). But it can also be used to describe just a single table (a Table Object). Or it can describe an object such as a DialectDescription which can’t be used on its own but which might be referred to by a Table object. When looking at the examples in the CSVW Primer webpage it is worth taking a moment to work out which type of object they are describing.
For example, the metadata file shown in Example 5 is describing a TableGroup object because it contains a ‘tables’ property. To see what other properties are allowed on a TableGroup object, these are given in Section 5.3 of the Metadata Vocabulary standard.
In contrast, Example 7 of the CSVW primer is describing a Table object as it includes the ‘url’ and ‘tableSchema’ properties. The details of the Table object are given in Section 5.4 of the Metadata Vocabulary standard.
Three levels of detail for CSVW metadata files
Having spent some time looking at CSVW files, it strikes me that there are three different levels of detail which can be considered when creating metadata.json files. The levels build upon each other, with Level 1 being the starting point.
Level 1: Descriptive Metadata
Here the metadata.json file contains good descriptions of the overall CSV dataset and of the individual columns of the data. This might include text descriptions of the how the data was collected, who collected the data, what the data in each column represents and what the units of each column were. In essence this takes the information which is often provided in readme.txt files, accompanying Word documents or dataset webpages and places the information inside the metadata.json file. This is a very useful things to do as the dataset metadata is now being described in a standard manner and it can be easily read by computers. This would mean that I can now create a list of the data column titles and descriptions in a few lines of Python code, rather than having to retype or copy this data out of a Word document or webpage.
Level 2: Metadata with Unique Identifiers
The findable criteria of the FAIR data guidelines suggests to use Unique Identifiers when describing both metadata and the data itself. In Level 2 this is achieved by using Unique Identifiers where possible rather than descriptions in the metadata file. For example, the units used by the data within a column might be described using a Unique Identifier from the QUDT Vocabulary rather than a descriptive string (i.e. for degrees Celsius using the URI http://qudt.org/vocab/unit/DEG_C rather than a string such as ‘degrees C’ or ‘C’).
It is also worth noting that the CSVW standard states that any additional properties on, for example, a Table object come under the category of what it calls common properties. The standard states that these common properties can only by provided using URIs. In effect this means that when describing a TableGroup, Table, Column etc. if you wish to include additional properties this can only be done using URIs. So for example a property names ‘description’ is not allowed, but one named ‘dc:description’ (from the Dublin Core vocabulary) is allowed.
Level 3: Metadata which converts to an OWL vocabulary
The FAIR interoperability criteria also require the use of ‘common vocabularies’ when describing data and metadata. Perhaps the ultimate way to achieve this, as stated in the FAIR guidelines, is through using RDF and OWL vocabularies. For example, if you have a series of measurement made by a sensor, this could be described using the SOSA vocabulary and the data stored as RDF. However there are two challenges with this approach. Firstly not many people understand the RDF data format. Secondly if you have many sensor readings (say 10,000+) then the RDF dataset becomes very large and unmanageable. CSWV solves both these issues by storing the data in a format which many people understand and in a more compact form which uses less memory.
However it turns out that there are procedures for converting a CSVW dataset into an RDF dataset, through including properties such as ‘aboutURL’ and ‘propertyURL’ in the metadata file. This means that there is the potential to create a CSVW file which, if converted to RDF, would be valid according to an OWL ontology such as SOSA. In this situation I think the CSVW dataset would then be meeting many if not all of the FAIR interoperability criteria.
Interestingly if such a dataset were created, i.e. a CSVW dataset which could be converted to RDF which aligned with an OWL ontology, there would be no actual need to do this conversion. The fact that it could be converted might be sufficient, and doing the conversion itself might make no immediate sense. It would probably be easier to analyse the data from the CSV file directly (i.e. using Python and Pandas) rather than analysing the data in RDF. In addition there is the advantage that a dataset structured in this way would also be understandable and usable to others who do not know about RDF and OWL as they can simply ignore these sections in the metadata file.
Summary
In summary CSVW has great potential for Open Data and Fair Data publishing. It has the advantages of being understandable for users with different levels of experience and is flexible enough to support simple metadata descriptions through to more complex use of Unique Identifiers and OWL ontologies. The CSVW standards seem robust and well developed, and able to describe a wide variety of data.
It is perhaps surprising that CSVW isn’t more well known and that it doesn’t seem to be being used very much at the moment. I can only assume that this will change as the UK Government and other organisations start to publish more of their datasets using CSVW.
Further information
- This YouTube video from the Open Data Institute is a great introduction to CSVW. Also available as an Apple Podcast here.
- The official CSVW website.
- The CSVW Primer W3C document.
- The Metadata Vocabulary for Tabular Data W3C document.