CSV on the Web: Creating descriptive metadata files
This post looks at creating CSV on the Web (CSVW) metadata files with simple descriptions.
Introduction
CSVW involves having the data of a dataset in a CSV file and the metadata about the data in a JSON file. This metadata might include information such as a description of the dataset, who collected the data and how it was collected. It can also include information on the columns in the CSV file (i.e. the variables of the data) such as a description of the variable, the units used and what datatype is being used (string, number etc.).
In this post we look at including such information in the metadata.json file using simple descriptions. This is the level of detail ‘Level 1’ as discussed in my previous blog post CSV on the Web: An Introduction. For many existing datasets this information will already exist but it may be stored in a non-standard format which is difficult for computers to read, such as a readme.txt file, a Word document or on a webpage.
What is a descriptive metadata file?
Here I’m defining a descriptive metadata file as one which the information is largely provided through descriptions using text strings. It will make use of the existing properties available to CSVW metadata objects where possible and, if any extra properties are required, it will use well-known vocabularies drawing on the examples in the CSVW Primer in the first instance. The aim is to get the descriptive information for a CSV file into the metadata.json file in the simplest manner possible.
A basic metadata file
Let’s consider an example which is given the the CSVW Primer. Example 6 shows a CSV file about countries, their names in different languages and their latitude and longitude. Let’s assume this CSV file is saved somewhere with the name ‘countries.csv’.
The basic metadata.json file for this CSV file is given in Example 7.
There are a number of things to note about this metadata file:
- This is a metadata file describing a Table object (i.e. a description of a single table). As such it is following the rules from Section 5.4 of the Metadata Vocabulary.
- The file also contains a Schema object which is the value of the
tableSchema
property. Schema objects are described in Section 5.5 of the Metadata Vocabulary. - The file also contains seven Column objects where each one is a description of an individual column in the CSV file. These are provided as a list in the
columns
property of the Schema object. The rules for Column objects are given in Section 5.6 of the Metadata Vocabulary.
Looking at the properties of the Table object:
- The
@context
property tells us that this is a CSVW metadata file and is always used in the first object in the metadata.json file. This is a Top Level Property as described in Section 5.2 of the Metadata Vocabulary. - The
url
property points to the location of the CSV file which this metadata document is describing. Assuming that the CSV file and the metadata.json file are stored in the same location, this can then be the name of the CSV file. - The
tableSchema
property holds the Schema object for the table. Schema objects describe the structure of a table. They are separate from the Table object as then the same schema might be reused to describe different tables of data.
Looking at the properties of the Schema object:
- The
columns
property provides a list of Column objects. Note that the order of the Column objects in the list is important. This order should exactly match the order of the columns in the CSV file.
Looking at the properties of the Column object:
- The
titles
property provides one or more titles for the column. If the CSV file includes a header row, then the header used for the column should match one of the titles given in this property.
How to start to edit a metadata.json file
The CSVW metadata files are in JSON format. These can be edited with a basic text editor such as Notepad. But it is much easier to use a dedicated JSON editor such as Visual Studio Code. You can find JSON editor software by googling ‘JSON editor’ and there are options for desktop applications, Chrome extensions and websites. The advantages of using a JSON editor is that is will highlight if the syntax of the JSON file is not correct, it will autocomplete closing parenthesis for objects and lists, and it will indent the file to make it easier to read.
Here I use Visual Studio Code to edit the metadata.json file of Example 7 to include @type
properties for the Table, Schema and Column objects. This isn’t strictly speaking necessary, but I thinking this helps to make clearer the objects which are described by the metadata file.
Adding descriptions about the table of data (the dataset)
Let’s say we want to include a description of the CSV file itself in the metadata. This might include what the data is describing, who collected the data and how it was collected. The CSVW Primer suggests that this can be done by adding additional properties to the Table object. These additional properties are defined as Common Properties, as described in Section 5.8 of the Metadata vocabulary. There is no restriction on what the additional properties are, but the property names have to be prefixed names, in the syntax prefix:name
.
Here we are using the methods given in the CSVW primer where possible. It suggests two possible options for property names for table descriptions as shown in Example 8 and Example 9. These are property names with the prefix dc
(the Dublin Core vocabulary) or the schema
prefix (the schema.org vocabulary).
Let’s take a moment to consider what these prefix:name
property names represent. In the case of the dc
property names these are referring to the Dublin Core vocabulary. This is a well known vocabulary of terms which anyone can use in their data and metadata. The idea behind such vocabularies is that they provide a stable and unique way of describing things. If some uses these terms to describe their data, it is possible for others to understand what is meant. The dc
terms are exactly defined so there is less chance of misinterpretation.
The dc
terms are described on the Dublin Core website. Example 8 of the CSVW primers uses the terms dc:title
, dc:description
and dc:creator
. The definitions of these terms, as provided on the Dublin Core website, are as follows:
dc:title
- see https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/elements/1.1/titledc:description
- see https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/elements/1.1/descriptiondc:creator
- see https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/elements/1.1/creator
Let’s extend Example 6 above by using these terms to describe the table of data:
Adding descriptions about columns
This can be done in the same method as for Table objects. Example 10 in the CSVW primer shows how the dc:description
property is used to provide descriptions for each individual column. As for Table objects, this is classed as a Common Property which in this case is part of the Column objects. Let’s include these descriptions in our example metadata file:
Adding datatypes to columns
It is very useful to know what type of data a column should contain. This allows us to check that the correct datatype is being used when adding new data. It also allows us to read the data into analysis software using the correct datatype. Datatypes can be added to Column objects using the datatype
property, which is an Inherited Property as described in Section 5.7 of the Metadata Vocabulary. Inherited properties can be applied to Column, Schema, Table and TableGroup objects, and the basic idea is that an inherited property applied to, say, a table will then also apply to all columns in that table (unless overridden by an additional inherited property of the same type on an individual column within the table).
There are two options for the values of the datatype
property:
- A predetermined string value such as "string", "number", "date" etc.
- A JSON object such as {“base”: “integer”, “minimum”: 1, “maximum”: 5}. This is taken from Example 19 of the CSVW primer and tells us that this datatype is an integer and its value must be between 1 and 5 inclusive.
Further information on datatypes and the meaning of the terms ‘base’, ‘minimum’, ‘maximum’ etc. is given in Section 5.11 of the Metadata Vocabulary.
The CSVW primer gives many examples of using datatypes (for example see Section 3). Here I’ve gathered these all together to extend our example metadata file:
This metadata.json file now achieves the goal of this blog post to create a descriptive metadata file. It includes information about the table of data and information about each individual column.
Summary
This blog post has shown how to create a metadata.json file for a CSVW dataset. It has shown how to describe the table of data itself and the individual columns within the table. The Dublin Core vocabulary is used to provide descriptions based on the dc
prefix. The types of data expected in each column in the CSV file is also described using the datatype
property.
This, I believe, is a great point to aim for, for anyone who is working with CSV data and who would like to share the data with the wider world.
Further information
- The metadata.json files created in this post are available to view on GitHub here.
- My blog posts on CSVW
- The CSVW Primer
- Metadata Vocabulary for Tabular Data