University data as #linkeddata

On the LinkedIn Linked Data group, Brian Kelly at UKOLN asked the question Which town or city in the UK has the largest proportion of students?

He has now summarised the responses on his blog

Very interesting question, fraught with definitional problems and an interesting exploration of the data quality of dbPedia. However, interesting though dbPedia is as social knowledge database queryable with SPARQL, I don't see that it is really an exploration of Linked Data.  I would expect the solution to be based on linking data from disparate, more definitive sources closer to the point of data collection. So student numbers from HESA (even if available only as .xls at present), linking institution names to places using the RDFed Edubased data, and then Census data on populations.

Student Numbers

The HESA site has a page on online statistics which leads to a list of products, of which the first, Students and Qualifiers Data Tables goes to a set of xls and csv tables showing analyses by different factors: subject of study; disability; ethnicity; institution level; qualifications obtained year by year from 1995/6 to 2007/8 (the latest collated)  For example the the institutional level data for 2007/8 as an xls.  We can use elev.at to do the conversion to XML.  Then a stylesheet to generate semantic XML. (and hence RDF?). OK, straightforward.

University locations


Discovering where Universities are is trickier.  Universities are included in the EduBase2 data and hence in the schools RDF in data.gov.uk -  this is the data on Aston in my prototype RDF browser and in EduBase2.

The schools dataset provides a range of administrative areas - Parliamentary consituency, OSN Census areas, as well as OS easting/northings and latitude and longitude. It is perhaps not so clear which of these geographic regions are most useful to answer the question, but this too looks doable.

Town Populations

The census provides population data and is readily avaliable online via the ONS  The latest data is 2001 but counts are available by areas given inthe EduBase2 data so this looks possible too, within the limits ofthe data.

Student Surveys

Although not directly relevant to the question posed, there are additional data sources providing quality measures and student satisfcation.

The Times Good University Guide is one, with individual pages for each University (Aston)

The Complete University Guide

The Guardian University guide (available as  a Google Spreadsheet)

Other data sources

For completeness, UCAS should be included as the central clearing house for University applications.  Its site provides more detailed data on each university and its courses.

Of course there are entries in Wikipedia e.g Aston University and hence on  dbpedia : Aston University

Linking University data

Linking these various sets of University data is however not straightforward. 

The HESA tables contain only the institution names but these are not stable (Aston University used to be called the The University of Aston in Birmingham). 

EduBase2 gives the institution name and two codes: UKPRN (UK Provider Reference Number issued by   UK Register of Learning Providers ) of 10007759 as well as its own code 133787. 

The university is identifable with the EduBase2 URI:

         http://www.edubase.gov.uk/establishment/summary.xhtml?urn=133787

but UKRLP does not display RESTful URIs thanks to the rewriting of URLs like:

           http://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_provDetails?x=&pn_p_...

In the RDF, new URIs have been minted  based on the EduBase2 code e.g.

            http://education.data.gov.uk/doc/school/133787

(as an aside, the creators of the latest data.gov.uk RDF dataset on BIS research funding also have entries for Universities such as this page (from the Stromness browser) and have chosen to mint new URIs for universities

           http://education.data.gov.uk/id/institution/H-0108

which at present are not resolvable)

UCAS also has a code for Universities (A80 for Aston) 

          http://www.ucas.ac.uk/students/choosingcourses/choosinguni/instguide/a/a80

with a lot of data (but nothing program-friendly) . 

The Times Guide  has no code and changes the name between pages - "Aston" in the guide , "Aston University Birmingham" on the individual page. The Guardian guide uses "Aston"

In the absence of  a common identifier,  linking can only be  based on fuzzy matching of instututional  names. 

What institutions are included?

To count students we need to identify which institutions to include. However sources differ in which institutions they include. Just looking at the numbers without matching the sets we see a wide range of implict definitions:

  • HESA  - 166
  • Edubase2
  •   Higher Education 139
  •   Further Education 482
  • UCAS  304
  • Times Good University Guide 114
  • Complete University Guide  113
  • Guardian University Guide 117 + 32 minor

Linked Data project

Integrating these disparate datasets represents an interesting challenge in Linked Data.  It is tempting to start to scrape and integrate as a private project, just for the challenge, but such an approach would yield not Linked Data but another data base requiring re-scrapping and recoding as sources changed structure. If Linked Data means anything, it means linking disparate datasets published as close to the source as possible - by UCAS, by HESA, by the Guardian, by EduBase etc.  Central to such integration is agreement on a common identifier, say, the UKPRN code, perhaps also expressed as part of a URI based on some agreed internet domain to construct a URI.