10 tips for publishing Open Data
In this post I give my top 10 tips for publishing Open Data datasets to make them understandable and reuseable for others researchers.
How should I publish my data? What file formats should I use? What information should I include in the readme file? How do I share complex information that won't fit into my tables?
These are all questions faced by anyone thinking of publishing an Open Data dataset. There is no single right answer or solution to this, as all datasets differ in their size, measurements, variables and purpose. However there are perhaps some overriding principles in creating and publishing Open Data, including:
- We want other people to be able to easily understand and reuse the data in their own research.
- We need to include sufficient information in the dataset to enable others to reuse the data with confidence, including information such as how the data was collected and what the measurements represent.
- We should make the data itself available in easy-to-access file formats and structures, so that others can easily incorporate it into their own analyses. In particular, we should make it easy for analysis using programming scripts to be able to access the data.
With these high-level principles in mind, here are my top 10 tips on publishing Open Data:
1. Don't use Excel
Excel is great tool for prototyping simple models and doing calculations on small amounts off data. It is my go-to tool when I have a new analysis thought which I quickly want to try out or for small data analysis tasks.
However as a method of sharing data with others it is not really suitable. Firstly it is a commercial tool and uses a proprietary file format; this limits access to the data as not everyone will have access to the Excel software. Secondly the data stored in an Excel file is difficult to access; if the analysis is being done with a programming script then this will take additional effort and time.
Instead, I would recommend using a CSV file to store the data. The CSV file format is text-based, so any computer will be able to open it, and it is easy to use programming scripts to work with csv files.
2. Don't use Word
Word is also not suitable for sharing data. Like Excel it uses proprietary file formats that not everyon will have access to. And more importantly any data stored in Word will be very difficult to extract using programmming scripts.
For example, I have seen cases where tables of data were shared as tables inside a Word document. Whilst this may be easy for me to read, it makes the analysis much harder as I need to perform several steps to extract the data from the Word document into a format I can use in my analysis programming.
Instead a text-based file format would be much better, such as CSV files for tabular data.
3. Avoid putting data in the documentation or the readme file
Let's say we have a questionnaire, and one question in that questionnaire asked 1,000 people a question with ten different response options. Each person had to choose one of the ten response options as their answer.
The responses of the 1,000 people are shared, perhaps in a CSV file. This is then easy to access and work with. However the ten response options to the question are published as part of a large Word document describing the questionnaire or in a readme file accompanying the dataset. If I wish to create a bar graph showing all response options and the frequency of responses, this is then difficult as I have to extract the ten response options from the Word document or readme file. For a single question this might not take long, but it quickly becomes tedious for multiple cases.
Instead it is much better to place the original question and the ten response options in a second data file. We then have two data files: one about the questionnaire and response options; and one about the responses of people who answered the questionnaire.
It may be difficult to store information about questions and response options in a CSV file, as different questions will have different numbers of response options. In such cases a nested, tree-like file format might be suitable. JSON files are good, text-based file format that can work well here.
4. Fully describe variables
We need to fully describe what is being reported on in a dataset, for others to be able to reuse the data with confidence. If a CSV file contains data with the header 'Temperature' then this leads to uncertainty in the measurement variable. What kind of temperature? What are the units? What is having it's temperature measured? How was the temperature measured?
The Semantic Sensor Network Ontology is one example of a framework which attempts to capture the information around variables. Figure 3 shows that for each observation the following information can also be included:
- The results of the observation, including the measured value and its units.
- The time when the observation was taken.
- The 'Feature of Interest' - i.e. the object that this measurement is based on, for example a person, a building etc.
- The 'Observable Property' - i.e. the property or quantity being measured, such air temperature, distance etc.
- The sensor which was used to make the measurement.
- The procedure which was using to make the measurement.
I'm not suggesting that you use this particular ontology or include all of this information. The wider point is that there is lots of information we can provide about measurement variables and how observations are taken. Providing this additional information, in machine-readable file formats, will greatly enhance the ability of others to make use of the dataset.
5. Fully define missing data
It's very easy to show missing data as a series of blanks or empty cells. But this doesn't help to understand how the missing data occurred which can be important when analysing the data.
There are a number of ways that missing data can happen:
- A sensor was supposed to measure a value at a timestamp but didn't because of a sensor error.
- A logging error occured in the data collection system, so although the sensor measured a value this wasn't recorded.
- At a particular timestamp a sensor wasn't programmed to record a measurement, so there is no value to report.
- A respondent didn't answer a question at all.
- A respondent didn't answer a question correctly (perhap giving multiple reponses when a single response was required), so that answer should be ignored.
- Implausible or error values have been converted to missing data points before the publication of the dataset.
- And many other possibilities...
It is good practice to make clear the reason for a missing data point. One way of doing this is to use codes (e.g. -999, -998, -997) rather than blanks to denote missing data. Different codes can be used to define the different reasons why the missing data occured.
6. Avoid using file names for data
The names of data files are useful for quickly seeing what data the file contains. For example datasets about buildings might contain a series of data files for each building such as 'Building1.csv', 'Building2.csv' etc.
However the file name should not be used to hold data itself, but rather be viewed as a useful aide to understanding what the file contains. This is because analysis programming using the dataset should be able to find everything it needs from the contents of the data files, and not have to rely oon the file names. In effect it should be possible to change the file names to any alternative names without effecting the integrity or stucture of the dataset.
In practice this means that it is fine the name a file 'Building1.csv' provided that somewhere inside the contents of the file it is clear that this data belongs to Building1 and the filename alone is not relied upon to convey this information.
7. Give context to the data
Imagine a study that places an air temperature sensor in a room and records the air temperature for several months. This temperature data could be downloaded and easily shared as a Open Data dataset.
However the temperature data alone would not be particularly useful for other researchers. For them to be able to make use of the data, the context around the measurements would be needed. In this case, the context would be the room the temperature measurements were taken in. What kind of room was it? What was the size of the room? What type of building was it located in?
Providing context is essential for making it possible for others to reuse the data, in particular for other researchers to use the data in different ways than the original study intended.
8. Provide analysis examples
It is incredibly useful to provide other users with some examples of how the dataset can be analysed. This immediately gives other users a 'way in' to making use of the dataset and helps them understand the structure of the data and what is possible to find out from the data.
For example, when I first published the REFIT Smart Home dataset I often received queries about how to work with the data. In response I created this GitHub repository with some examples of data analysis using Python and Jupyter Notebooks. This proved to be very useful to others to see how they might start analysing the dataset.
9. Use unique identifiers
Whenever we use text or words to describe concepts we may create some possible ambiguity and make it harder for programming scripts to understand our data.
For example, let's say I have taken some temperature measurements and the units are degrees Centigrade. How I describe this unit of measurement in my dataset? I could use the text 'degrees Centigrade', 'degC' or just 'C'. As humans we may be able to interpret what this text means, but we can't assume that a computer will be able to.
An alternative is to use a unique identifier for the concept of degrees Centigrade. One such unique identifier has bee created by the QUDT organisation and is given by the URL http://qudt.org/vocab/unit/DEG_C. Using this unique identifier in the dataset to represent the concept of degrees Centigrade is unambiguous. In addition following the URL leads to a webpage with a definition of this measurement unit.
Unique identifiers can be used for concepts such as measurement units and also as identifiers for physical objects that our datasets are about, such as people, buildings etc. This can be a powerful tool for reducing uncertainty and ambiguity in datasets, and enabling common descriptors across different datasets.
10. Consider the FAIR principles
The FAIR principles of Findability, Accessibility, Interoperability, and Reuse have been widely adpoted by research organisations and the Open Research movement in recent years as good practice for publishing and sharing datasets.
How exactly they should be implemented within different scientific disciplines is still of much debate, but the priniciples themselves provide a good set of guidelines to consider when pubishing Open Data datasets. A useful approach is to consider each of the FAIR sub-categories and evaluate how far your own dataset goes towards meeting the criteria.
The FAIR principles emphasise unique identifiers, interoperability and focussing on machine-readable data structures. They recommend using "commonly used controlled vocabularies, ontologies, thesauri" and "a good data model (a well-defined framework to describe and structure (meta)data)". This may lead to using more advanced data models to store Open Data datasets such as the RDF data model and the OWL ontology language, both mentioned specifically in the FAIR documentation.