Data communities need to come before data catalogues
Barely a week goes by without a tweet calling for a data catalogue for the UK’s energy data. 20 years working in the energy industry, having experienced a number of similar initiatives attempting to catalogue the data held by my own employers, mean that my response to these latest calls is a mix of scepticism and sadness.
To be clear, I would love a good data catalogue. There is a lot of energy data out there, from many different sources, of varying quality. Consumption and generation data, weather data, flow data, price data, all at different locations and timescales. It is incredibly hard work for an industry newcomer to keep track of the meaning of different fields, let alone their limitations.
My worry is that I’m not sure that we can create a data catalogue that can meet all these needs. Much of the meaning of energy data is derived from context, which is much harder to document succinctly, but crucial if the data isn’t to be misused. A couple of examples might help illustrate this:
- National Grid publish solar generation data, but what sites are excluded? Is it transmission/distribution loss adjusted? To what extent is it modelled vs measured? To what extent to these factors matter for an individual user, depending on what they are doing with the data? Will the importance of factors change over time?
- BMRS publish much of their data by settlement day and half hourly settlement period. But how do these settlement periods handle the clock change? When the clocks move forward, will they skip settlement periods 3 and 4, or will they skip periods 47 and 48?
One response to these sorts of questions is to say “if it matters, put it in the catalogue”. But that is a lot of documentation, likely overwhelming for any user. It is a massive amount of work for whoever is doing the documentation. It is also difficult for someone who has produced the data to know what other people don’t know. And in other cases, users may disagree on answers to some of the questions.
Related to this, data catalogues often suggest a sense of being the final word on the data. Contributors are often scared to enter information until they are certain, or they hold off until the long-awaited data improvement exercise can take place.
The following diagram illustrates how I see data catalogues aspiring to exist:
In practice, what usually happens is that no one catalogues or documents the data, and/or no one reads or trusts the catalogue.
What I think we need is more of a living platform for documentation and enriched data to develop. We need to get away from the idea of a small set of participants being responsible for all the data generation and documentation, and allow a role for an entire community of data contributors, along the following lines:
This are many examples of this kind of collaboration effectively, from wikipedia to online forums to internal company platforms. At a minimum, I’d be proposing a space for participants to ask and answer questions, to share suggestions and documentation.
It is entirely possible that a data catalogue would end up being created by a data community, which would be great. However, even if so, it would allow the work of creating the data catalogue to be shared, for it to focus on the content deemed most useful by the community, and for the catalogue to operate alongside community created guidance and a history of frequently asked questions.