This is the first in a pair of posts designed to provide a primer on sources of open data. This post focuses on open government data and the next focusses on open data beyond the public sector.
To understand why public sector information is the main source of open data you may wish to read about the economics of open data.
Who is publishing Public Sector Information - Where can I find it?
In 2009 data.gov.uk was established to provide a central catalogue of public sector information, covering over 9,000 sources (as I write this). It provides meta data, that is, it acts an index of other data rather than a respository itself. The site is built on the Open Knowledge Foundation’s CKAN platform. It provides a SPARQL endpoint for a variety of Linked Open Data resources. In theory, this site should cover most of what is available from Government. Although in the past, the coverage has been patchy in parts and duplicated in others, this has largely been resolved over the past few years. The main problem with data.gov.uk is that it’s sometimes best to go direct to the horses’ mouth particularly when what you’re interested in is a delivery API, not metadata discovery. As such I’ll go into a little more detail about the places you’ll usually find yourself after you’ve explored data.gov.uk.
The Office of National Statistics
The Office for National Statistics is the executive office of the UK Statistics Authority. The authority is a dedicated non-ministerial government department with responsibility for assessing Official statistics (i.e. “those produced by a government department or persons acting on behalf of the crown”) against it’s Code of Practice to ensure that only compliant publications are designated National Statistics.
The ONS has long been providing data as both datasets and published reports (books, articles, bulletins) that include some narrative and interpretation. The most recent incarnation of the ONS site has convenient filters for discovering what available by theme, release date (including forthcoming releases), and geographic scope and precision (i.e. where is covers and with what breakdowns). While it’s naturally tempting to rush to the raw data you should also bear in mind that the methodological guidance is very important - there’s many a time this has either saved me making a mistake in interpretation (fixed-capital investment figures used to be apportioned by jobs and so were no more informative for making regional comparisons) or suggested an alternate ‘experimental’ dataset that offered new insights (e.g. multiple approaches to measuring Gross Value Added).
Of particular note are:
- the ONS’s site for labour market statistics - nomis - they have a RESTful API which is compliant with the Statistical Data and Metadata eXchange (SMDX) ISO standard. The API offers both discovery and delivery services, URI resolution, and HTML, XML, json, and csv response formats.
- the Neighbourhood Statistics Exchange (NeSS) - with their Neighbourhood Data Exchange (NDE) API, version 2 of which is now RESTful (I’ve got some ruby bindings knocking around somewhere if you really want to use the SOAP interface instead).
- the forthcoming ONS API which will operate under a similar principle and is designed with data from the 2011 Census in mind.
Central Government Departments
As part of their daily work and obligations for reporting, Central Government Departments produce great volumes of data that is publically available. This departments have gradually be brought out from a heterogenous collection of sites operating under a department.gov.uk sub domain to a consistent www.gov.uk/department arrangement. This has made it much easier to search across Government for statistics or research and analysis. Indeed a convenient index of statistical data sets is provided.
What’s available? The Big Names in PSI
- For transport, the main data sets are the National Public Transport Access Nodes (NapTAN), which unique identifies public transport access points, and the National Public Transport Data Repository, which provides a snapshot of every GB journey in ATCO-CIF and TransXChange formats. Historically, the NPTDR has been taken annually in October but it is now being provided weekly as the Traveline National Dataset (requires registration for FTP access).
- The Indicies of Multiple Deprivation - which ranks English neighbourhoods (or Lower-layer Super Output Areas - LSOAs) according to seven dimensions of deprivation. The IMD is often used as a means of allocating resources by need. NB: the publication is irregularly timed and has (historically at least) been subject to methodological changes between years making it difficult to use this dataset for trend analysis. It does, however, provide an incredible level of geographic precision.
- The Combined Online Information System (COINS) - which is basically the governments central accounts, all 4.3Gb of them!
- The ONS Postcode Directory and National Statistics Postcode Lookup - which provides a lookup for transforming postcodes into other geographic terms including statistical areas (e.g. LSOAs) and geocodes (i.e. Latitude-Longitude/ Northings-Eastings).
- Ordnance Survey Open Data including the ‘tiles’ of the map (raster), government administrative boundaries (vector ESRI shapefiles), code-points (a csv of postcode coordinates etc), and gazetteers of place and road names.
- Legislation.gov.uk publishes all UK legislation in both the original enacted form and with revisions that are made over time. The platform is managed by the The National Archives and provides Linked Data access (try adding /data.xml to the end of a URI).
If you’re interested in more of the top datasets you might like to peruse data.gov.uk’s own list of popular datasets or read on to find out about what’s available outside of the public sector.
Have I missed some major sources or data sets? Let me know in the comments…