Last week Prime Minister David Cameron announced that more governmental information produced by the UK public sector should be opened to the public. This will according to the Guardian make the UK data.gov.uk repository the largest in the world, exceeding the US data.gov and equivalent in other countries.
I decided two months ago to have a closer look at the emerging trend of government sharing their data with the public, and yesterdays statement adds to my impression of the importance this process has gained. Several countries and organisations are now sharing data they have generated and aggregated through governmental repositories into the public sphere.
In a letter on transparency and open data sent to the ministers of the cabinet the Prime Minister writes, addressing the secretaries of state:
As you know, transparency is at the heart of our agenda for Government. We recognise that transparency and open data can be a powerful tool to help reform public services, foster innovation and empower citizens. We also understand that transparency can be a significant driver of economic activity, with open data increasingly enabling the creation of valuable new services and applications. (The whole document containing a list of public data made available can be found here.)
The wording of this introductory excerpt is similar to that used by president Barack Obama in his memorandum on Transparency and Open Government sent to the heads of executive departments and agencies where he denotes that governments should be transparent, participatory and collaboratory and instructing the addressed departments and agencies to include these principles.
The goal of the data government repositories are to make open, high-value data produced in the public sector available to the public, but what does open and high-value mean?
Here it can be useful to make a distinction between public and open data. Open data can be considered as data fulfilling the list below, this can be data from any entity. Public data however have two meanings. Public data can either be data released into the public sphere or data produced and aggregated by the public sector. I will as far as possible try to keep the distinctions clear by writing out the whole sentence whether the public data is produced by the public sector or released to the public sphere. This can also be done reciprocally when the public sphere produce data which is latter used by the public sector as in Fix My Street developed by My Society.
What is Open Data, and what delineates open from closed data? Is it enough to have the data accessible for the public to see, or does also other factors play in? To decide what should be considered as Open Data we need to set a threshold. Definitions are not neutral, and they bring with them implications.
Open Knowledge Foundation has on the site Open Definition defined what should be considered as Open. The Open Knowledge Definition contains an eleven point list over what recognises open knowledge. This list is influenced and hence similar to the Open Source Definition created by the Open Source Initiative, and inherent much of the ideology and principles behind the Open Source movement which evolved and manifested itself in the computer culture. Whether all, less or more criterion need to be fulfilled for an element of knowledge to be considered open is a normative question, but of practical reasons let us accept the Open Knowledge Definition as our definition.
Absence of Technological Restriction
No Discrimination Against Persons or Group
No Discrimination Against Field of Endeavour
Redistribution of Licence
Licence must not be specific to a Package
Licence Must Not Restrict the Distribution of Other Work
The data has to be accessible by all, and should be published under a licence granting the user right to redistribute and build upon the data. The data should not be discriminatory, and should make it possible for the user to use it for every purpose the user want to use it. Richard Stallman, developer of Emacs and the man behind GNU General Public Licence and Free Software Foundation is famous for his quote, summarising what should be considered as open in a comparison between two definition of free: free as in free speech, not as in free beer.
An important motivation for the release of data to the public sphere is the idea that data can be a valuable asset in the hands of the public. In directive 2003/98/EC on the re-use of public sector information the European Union sets guidelines where member states are asked to make available public sector information.
Another important aspect of the release of open governmental data is that is has to readable for both computers and humans. This may be seen as mundane, but if data are released as images or embedded into Portable Document Format documents or Flash the data is not readable or it would take a an unnecessary effort to screen scrape the data from the site, on the other hand if data stored in a binary format they are not explicitly readable by humans.
The organisation working with developing standards and to lead the web to its full potential, the W3C – World Wide Web Consortium, has suggested several ways of making data available to the public, and encourages government to enrich their online presence with semantics, meta-data, and identifiers as well as opening data in open formats and industrial standards especially XML, and allow the information for electronic citation.
In addition to XML and HTML based files are many of the services based on comma or tab separated lists for static data, and application programming interfaces for data that are frequently updated. Data sharers are also encouraged to release their data with semantics, and containing relevant meta-data. This is important to place the data in an understandable context, and of practical reasons such as ensuring more correct search engine result, or improving intelligent applications understanding of the data. This goes into the idea of a semantic web, where computers are more aware of the semantic content of the data represented. Through marking data with semantic denotation such as Resource Description Framework or Web Ontology Language developers can use data mining techniques to find connections between the data-sets by applying computer intelligence.
A five star scheme is developed to rank the semantic web value of data-sets and its meta-information according to how it has been released on the web. The data get one star as soon as it is released with an open licence. To get two stars the data has to be made available in a machine-readable structured format, and if this is a non-propriatary format e.g. CSV or XML instead of Excel the set is ranked to three stars. The two last stars are reserved for data that already have fulfilled the criterion for three stars, but in addition mark the data with semantic markers. The four and the five star evaluation is based on whether the data is placed in a context of other people’s data. To gain the top score the data has to be placed in a context (T.B.Lee Linked Data).
Governmental Data Repositories
Data.gov of the United States of America, data.gov.uk of the United Kingdom and data.norge.no of the Norwegian government are just three examples of governmental data repositories aggregating data from the public sector in each country. We will review this three examples later, but first it can be beneficial to mention some of the similar traits shared by all the examples included in this paper, and also by other versions.
A naming convention seems to be established by the adaption of the data prefix followed by the domain of the public services. The repository sites are divided into section where users can find raw-data and where they can find applications made from these data. Both these options contain a search function where the user can find data or application based on criterion such as format, publisher of data, and topics. The data can also be sorted by ranking, number of visits and number of downloads.
Data.gov.uk is the data repository for the government of the United Kingdom, and has been formally online since January 2010. The data can be sorted according to which department or agency that owns the data, and by doing this we can find find out that Department of Health is the largest contributor with 1001 data-sets, followed by The Department for communities and Local Government (781 data-sets), and UK Statistics Authority (716 data-sets). The site has 157 Apps registered. The work on the site is overseen by the Transparency Board where Sir Tim Berners-Lee is one of the members, the other member include Dr. Rufus Pollock who was one of the founders of the Open Knowledge Foundation and Francis Maude, the Minister of the Cabinet Office. The implementation of data.gov.uk is led by the Transparency and Digital Engagement Team in the Cabinet Office.
The data published on data.gov.uk is licensed under an Open Government Licence. This licence has been developed for making reuse of public data easy, and is maintained by the National Library.
Data.gov is the data repository of the United States of America. it was opened after a memorandum on transparency and open government to the administrative entities following the inauguration of president Barack Obama in 2009, in this the new president urged for a unprecedented level of openness in government. The site was officially opened on 21. May the same year, and when first opened it was released with 47 data-sets. On the two year anniversary of the site. Focus at the top administrative level has been said to be important for the quick release of data. The American government, and its agencies has an open attitude to sharing of data, and the democratic aspect has been emphasised through transparency and accountability.
Data.gov has also released a substantial amount of geographical data in addition to the other data-sets, this data contains over 390,000 records.
Data.norge.no is the future site of the Norwegian data repository, now hosting a blog and lists to beta data-sets and applications.
The Ministry of Innovation, Administration and Church Affairs administers a blog that lay out the development of an opening of public data. The blog, and some beta features including XX data-sets and applications can be found following the naming conventions already used by the US and UK governments. This blog was opened with a blog post by Minister Rigmor Aarud 19. April 2010.
The Norwegian Government has chosen to develop their own licence for sharing data. This has meet critique as a universal licence has been preferred, but creating an universal licence available in the local language can also be a better solution if the alternative is for the local organisations to use a separate, customised licence.
The Norwegian solutions focus is less on transparency and accountability, and more on value creation and new innovative and entrepreneurial solution. Sharing of public data has not had a similar attention as in the US and the UK, but it has been listed as a focus area (fellesføring) for the central governmental agencies and directorates.
In May 2011 16 nations, several other governances and organisations ranging from local municipalities to international organisations had opened similar repositories to the world. The countries that have released public data is mainly located in Europe, North-America and Oceania, but also some African and South-American countries have opened data.
A list containing open data catalogs can be found on datacatalogs.org. This list is curated by open data experts from different branches of government, organisations, and NGOs. The list was launched the Open Knowledge Conference in June 2011, and contains over 125 references to open data repositories. Among the organisations that have released high-value public data is the World Bank and the United Nations. This site is developed by Open Knowledge Foundation, the same organisation that have developed the Comprehensive Knowledge Archive Network the data stores the catalogue behind data.gov.uk.
This document is still under development so if you have any questions, feedback or correction please get in touch. The illustration Picture is made by the Sunlight foundation and borrowed from their homepage. Please read this blogpost for more information