April 9-12, 2008
Montréal, Québec, Canada

Exploring Museum Collections On-line: The Quantitative Method

Frankie Roberto, Science Museum, United Kingdom


Any museum professional will tell you that collections are at the heart of what a museum is about. With collection sizes typically too large to have everything on display at once, museum Web sites have often been called upon to help provide greater access to museum objects. The approach, though, is typically one of depth rather than breadth, focusing on highlighted objects and curated themes. This can deliver great educational content - but how might you build a Web site to represent the collections of a museum as a whole?

The paper sets out to answer this question by way of original research and experimentation on real data sets of museum objects, obtained from a number of UK museums by way of a Freedom of Information request. These data sources are roughly hewn together - a technical and semantic challenge that's briefly explained - to form a single, enormous database.The result is a prototype Web site employing a fresh approach to viewing museum collections on-line, eschewing details in favour of high-level overviews and visualisations, incorporating user annotations and revealing insights into the histories of museum collecting. The 'who', 'what', 'where' and 'whens' of objects are all examined in turn as axis for understanding and grouping collections, with the 'why' left open for interpretation and comments.

Looking forward, the paper asks how museum objects might fit into a 'web of data', where collections from different museums can be compared with each other, and perhaps even with private and personal collections.

Keywords: on-line collections, social software, quantitative analysis, data visualisation


A stereotypic description of what's inside museums might be 'things in glass cases'. Whilst many museums now contain far more than this, delivering an experience that encompasses a whole range of presentational devices to tell compelling stories, the collection and display of historic objects is nonetheless still at the heart of what museums are about.

Something that's well known amongst museum staff, but less well known to the general public, is the scale of museum objects not on public display. Percentages of 10% or less aren't uncommon, and the sheer size of reserve collections can be pretty staggering. At the Science Museum, for instance, despite having 60,000 square metres of floor space, there's still a need for two separate off-site object storage facilities. The first is Blythe House, a huge building that used to be HQ for the Post Office Savings scheme, with over 90 rooms of objects. The second is even bigger, an ex RAF air base, where airplane hangars have been converted into huge warehouse-sized storerooms for museum objects. Together with the museum, these locations hold over 300,000 objects.

As storing vast quantities of objects that nobody knows about could be considered a waste of time (and considerable expense), museums typically embark upon schemes to give greater access to this untapped collection of objects. These can include public tours of the storage facility, or temporary exhibitions operating a kind of squad rotation to put different objects on public display. A third way, though, is to offer some kind of Web-based 'access', which has the advantages of infinite capacity and potentially world-wide reach.

There seem to be two approaches to producing collection-based Web sites, both of which will be familiar to the museum professional. The first approach is to focus upon interpreting a modest number of objects, using curated themes and rich-media to tell stories and deliver educational outcomes. Examples of this kind of work from my institution are Ingenious (, Making the Modern World (, and an upcoming project, internally called 'Sickness and Health'.

The second approach is to transfer a museum's internal collections records into a Web site. This is usually only practical if the records are held in an electronic format, and of course you can only publish the data that you have ­– and that data may not extend to photographs and well-written descriptions of every object. Nevertheless, the result can be a valuable resource. Usually the primary user interface is a search box, and the primary use is for research purposes. It can be possible to enable a more browsing-based interaction, and even serendipitous experiences (see Chan 2007).

The root of the research that the rest of this paper explores, however, comes from an observation that neither of these approaches gives much of a sense of the totality of a museum's collections. Neither offers much help in answering the question, 'What kind of stuff does this museum generally have?' This isn't, perhaps, the most important question that museum Web sites should be answering. Many people will only be interested in a particular subset of objects, or in learning about a particular topic. However, I think there's value in attempting to represent the breadth of a museum's collections, and doing so might help to reveal some interesting things about a museum's very identity.

Getting Started

A discussion at the University of Leicester kicked off the idea that got me started with this project. An information workshop day held before the UK Museums on the Web Conference 2007 looked at the creation of 'mashups'. Naturally, object databases were obvious candidates for a source of data to 'mash up'. A lively debate ensued, where participants pointed out the difficulties involved in accessing and combining collections data. These difficulties include bad or inaccurate data, incompatible meta-data standards, the use of proprietary and difficult-to-handle data formats, and so on. Even well-funded, multi-institution projects that aim to present a single searchable object repository can be beset with problems.

In a fit of frustration with these apparent difficulties, I suggested that inaccurate data didn't matter for the purposes a generalized aggregate view of the objects and that different meta-data standards could be crudely mapped together. The problem of actually getting the data could be solved, I suggested, either by screen-scraping an institution's Web site (a common hacker technique which basically parses data out from an automated crawl of Web pages), or by simply asking a museum for the data under a Freedom of Information request. Whilst someone joked that this might amount to a Denial-of-Service attack on a museum's records department, there was general consensus that this could actually work, and I was encouraged to 'give it a try'.

Fast-forward a couple of a months, and I put my money where my mouth is by issuing FOI requests to each of the UK's National Museums (as designated by an Act of Parliament). I'm not expecting every museum to comply with the request, but I hope that a few will, giving me enough data to get started. Meanwhile, I began designing the prototype Web site that will hold this data.

Architectural Design

In designing this prototype Web site, I've used the framework that's emerged from recent thinking about designing for a 'Web of Data'. This concept suggests that Web sites are moving away from pages of content, and towards providing representations of large data sets, exposing these through Web services as well as HTML pages. These services result in a rich set of resources that live beyond a single Web site, and can co-exist alongside other services, giving access to different data.

Tom Coates (Coates, 2006) proposes some key architectural principles for building a service that can aggregate with this 'Web of Data'. One of the first principles he proposes is 'to identify your first order objects, and make them addressable' (via a unique, persistent identifier). Some examples from popular data-driven Web services are 'products' on Amazon (, 'photos' on Flickr ( and 'URLS' on ( For the purposes of this project, the first-order objects are clearly 'museum objects'.

One key attribute which first order attributes need is to be clearly discrete. There's an interesting side-argument here for museum objects, which colleagues remind me are notoriously difficult to count. With a stamp collection, for example, should the individual stamps count as objects, or should the books that contain them count? Whilst there are some interesting metaphysical arguments to be had here, for the purposes of designing a Web site exposing museum objects, we can simply defer to the institution as to what counts as an single object, and what doesn't. Whilst this might not be standardised, the choices made might actually reveal something about the museum - the point of this Web site, after all.

Having identified our first order objects, and given them a URI (see table below), the next architectural principle to follow is to build 'list views' that can be used to navigate between the first-order objects. For examples of these navigations axes, we can turn to Flickr, which has them in spades, including all of the following: photographer (user), tags, date taken, location, group 'pool', camera model, users who have 'favorited' it, and so on. Note that some of this data is gathered automatically (date taken and camera model), whilst other data is added manually by users (tags and location).

Museum objects can have a wealth of meta-data surrounding them. Indeed, establishing meta-data standards for museum data is something that has long been debated by experts. Whilst this is a valid activity, for the purposes of this research I am not interested in detailed and accurate data for individual objects, but instead in data that can be aggregated and used to provide some insights into the collection as a whole. This data can be crude and sometimes inaccurate, so long as it's available for a reasonably large portion of the data set.

The '5 Ws'

To find these core meta-data axes, I turned to the '5 Ws' of journalism, said to provide the key first bits of information: Who, What, Where, When and Why. 'Who' is a reasonably tricky one for objects, as different people could be involved in an object's design, manufacture, or even as its subject (eg in a photograph). Sometimes all of this data is recorded for museum objects, but I wanted simply to focus on what the object might say about the institution, and so the only 'Who' I'm interested in is the person who collected it. This could be a historic figure (Henry Wellcome), or a curator. In many cases, this data isn't recorded, but where it is, it could prove interesting.

The second W, 'What', is perhaps the most important, but also the most potentially difficult to define. Typically, collections databases may devote paragraphs of content to describing 'what' an object is. In order to show an aggregated view of what objects a museum has, however, data needs to be easily processable. The model I've adopted for the design of this prototype is to gather short key-words, or tags, that describe the object on a most basic level. These might be, for example, 'clock', 'flag', 'coin', and so on. Gathering these keywords may be a challenge, but I hope to be able to extract them in a crude manner from whatever categorisation data is available. If this fails, perhaps there's a social solution (see next section).

The third W, 'Where', might refer to the place an object is from. I don't want to worry too much about which sense of belonging this might refer to. For our purpose it can simply be where an object was collected from, information that is often part of the data recorded by museums. As places are relatively abstract concepts, to use place as an axis for aggregation, we need to pick a level of granularity that yields a reasonably discrete list. 'City' is a bad choice, as many objects aren't from cities, and so I'll take the easy option of only caring about places at the country level. Whilst country borders and identities aren't without dispute, they're stable enough, and there's a helpful list provided by the ISO, with standardised codes for each country.

The fourth W, 'When', adds a temporal dimension. As museums typically keep objects for a long time, and have historic collections, this is a key one. Timelines are a fairly standard means of navigation on museum object Web sites, and for good reason. In my case, as I'm interested in what a museum's collections might say about the museum, the 'when' I'm interested in is when the object was collected. Not only is this significant in telling the story of a museum's historic activities, but it also tends to be one of the things that routinely gets noted in a museum's records. This may be recorded as a full date, but for aggregation purposes I'm only going to bother with dates at the level of year.

The fifth W is 'Why'. This is the trickiest, as the reason for collecting an object might never get recorded. Even if it is, as in the modern museulogical practice of adoption acquisition policies, it's hard to see a way of exposing this as a means for aggregating a museum's objects. So I'm going to cop out and not bother with this one. To replace it (five is a nice number), we can borrow 'How', which is sometimes lumped in with the five Ws anyway. 'How' in these circumstances can refer to 'how' an object was collected. The exact means by which an object was acquired might be complex and unrecorded, but there are a few different options that we can generalise upon: objects can be acquired through purchase, donation, or loan ('stolen' might be a fourth option, but let's not go there). This is not perhaps the most interesting-sounding meta-data, but it might give us some insights nevertheless.

So, to recap, the axes I've selected (and each gets its own URI; see table below) are Who (a person), What (some tags), Where (a country), When (a year) and How (purchase, donation or loan). Each of these concepts is fairly simple, and they are so widely used that there are standardised ways to represent them in machine-readable formats. People, tags and dates, for example, can all be represented by 'microformats' (see This is significant, as microformats aim to 'pave the cow paths' of current behaviours and usage patterns, and represent simple bits of semantic information that can be embedded within normal HTML pages. So if our meta-data correlate with existing microformats, rather than complex Semantic Web-style ontologies, there's more chance that the data will be interoperable, and actually used.

URI Description
/objects/ID Object page
/people/ID 'Who' - the people who collected the objects
/tags/TAGNAME 'What' - tags describing the object
/countries/ID 'Where' - the countries the objects were collected from
/time/YEAR 'When' - the year an object was acquired.
/acquisition/METHOD 'How' - the means by which an object was acquired.

Table 1: URI scheme for a museum object data Web site prototype.

Social Interactions

I have so far identified the first-order objects and 'list views' for a museum object Web site, but a final design step is to think about how this 'web of data' might be socialised. We don't need to add a social dimension to data visualisation to be interesting, but it could help us in a number of ways. The first, most prosaically, would be the augmentation of our data, with users both filling in gaps and adding extra data. The second is as a way of allowing the most interesting observations to surface. The final reason is to simply keep the service alive - socialising around the web of data is something people naturally do, and it will often drive the use of the data, and even the data itself, in interesting and unpredicted ways.

Rather than just guessing at what social elements might work for a 'web of data' around a museum's object collection, blogger Jyri (Engeström, 2005) has defined a useful framework which in some ways builds upon the architectural principles proposed by Coates. Engeström suggests we start by defining what our 'social object' is. This is important, he reasons, because the most engaging interactions happen around objects - photos, films, books, and son on. He contrasts these interactions with the initial social networks - Friendster being the main example - which were focused solely around people and their connections. Here, he says, there was nothing much to do other than to 'add friends'. Once this activity (and calling it an activity is apt, as Jyri quotes another blogger who went so far as to say about Friendster that 'That was the "game" right? He who has the most contacts wins.') dried up, the sites stagnated. Myspace ( might be an example here, but it has arguably come to find its social object in bands and musicians.

Do museum objects make a good 'social objects'. Jyri has posted some thoughts on this too (Engeström, 2007). One key factor is something he describes as an object's "social gravitational pull" - in short, how much people care about it, and how many ‘handles’ there are to generate discussion and conversation points.

If you look at most of the successful Web services that have built around the 'Web of Data' model, the 'social object' that the site is built around tends to fit one of two categories. Either it is an object that all users create, as in the photos on Flickr or the trips on Dopplr, or it is an object of culture that few people create, but lots of people are interested in, as in the books on Amazon or the films on IMDB/ or Netflix. The former group of objects have inherently high social pull, as the people who create them have a natural emotional attachment. The second group of objects are less inherently social, but instead are elevated to a level of social engagement through their position in popular culture. Museum objects don't really fall in either camp. They aren't created by the users, nor do they fall into easily accessible culture.

Museums typically hold some objects which do have high emotional and social appeal, however. These are usually either icons from history (such as the Science Museum's Stevenson's Rocket), or items with a high recognition factor, evoking personal memories and nostalgia (old televisions, washing machines and toys). Across the collection as a whole, though, these objects can be few and far between. An anecdote often quoted at the Science Museum is that we have, somewhere, a collection of several thousand scalpels, each subtly different from the next, but largely similar. Whilst this collection as a whole is interesting, the individual items can often have low social pull.

This creates a problem for any Web site trying to add value to a web of museum object data through social networking and user interaction. A better museum candidate for something to build a social Web service around is exhibitions. These are less granular, and have social 'pull' from the fact that people physically visit them, sometimes even paying for a ticket. A social Web site based around museum exhibitions would be an interesting project to pursue, although it would have to cover all museums in order to have enough data to be interesting.

For this project, though, the focus is firmly on museum objects. A way to get around the problem of the millions of individual objects having low social pull is to de-emphasise the importance of the individual object pages, which will have very little data anyway, and play up the importance of the aggregation pages. If an individual scalpel isn't interesting, then perhaps the collection of scalpels will be. This is somewhat equivalent to the 'groups' feature in Flickr, where a 'pool' of photos can be collected, with discussion and members attached to the collective group of photos.

Aggregation pages, or 'List Views'

Designing a Web service around aggregation pages is hard. Not only do I need the pages for each Who, What, Where, When and How, but I also need pages where these can be combined, so that it's possible to point people at a pages of 'objects from the Science Museum collected in 1962 from India', or 'objects from the National Maritime Museum that are coins and were acquired by a purchase'. There are, of course, millions of permutations. These views could simply be accessible through an advanced search, of the type where you select two fields, their values and an 'AND' operator to specify the intersection of these fields. This is a possible, and perhaps even useful feature, but it would be hard to make it easily usable. It's well-known that 'advanced search' options rarely get used on public Web sites, and anyway, this feature would suffer from the problem that, before searching, a user would have no way of knowing how many results would be in the returned collection.

A better approach might be to use some kind of sophisticated computing algorithm which identifies properties of objects that seem to follow a pattern, and suggests collections which are possibly interesting. Flickr has experimented with this kind of approach with their 'clusters' feature, where the image tags are analysed for tags that often appear together; they are then presented as a group comprising the photos which are tagged by most of the tags within that cluster. This relatively simple concept can generate some genuinely meaningful groupings. The clusters for the tag 'museum' are, at present 'art, architecture, sculpture', 'newyork, nyc, newyorkcity', 'paris, france, louvre' and 'london, england, uk' - which together do seem represent some of the most popular locations and types of museum.

Finally, we can look to social actions to help pick out some of the most interesting aggregation pages. There's a good history of this approach. At the simplest level, you can simply watch what people are talking and blogging about elsewhere. Google Earth, for example, is an enormous resource for aerial photography, but much of the world is just boring ocean, dessert, or endless suburban sprawl. Nevertheless, there are plenty of fascinating things in there if you know where to look, and helpful bloggers have documented many of them. A feature on The Register, for example, asked people to send in sightings of 'black helicopters', or other secret military equipment (see Haines 2005). The result was staggering - not only were there many spottings of black helicopters, but also spottings of crop circles, fighter planes, nuclear submarines and more. For other examples, we can look at the work of Stamen Design (, an agency famed for data visualisations. One of their projects, Trulia Hindsight ( plots house building data on an interactive map, showing how cities and towns were formed over time. Again, much of it looks the same, but users are able to point to interesting places, such as where planned towns were rapidly constructed, or where an area had to rebuild homes lost to a fire, and these places are then exposed to users as good starting points. This kind of model could work for a collection of museum objects.

Having established what our social objects are, and how social activity will be focused around the aggregation pages, it's worth thinking briefly about what these social interactions might be. Engeström suggests that a key principle is to 'define your verbs'. These are often to 'add' something (a photo, a comment, a tag), or might be to share, to save, to recommend... For a Web site about museum objects, the kinds of things we might want our users to do might be 'to bookmark' a collection, either within the site or on an external service like, 'to comment' on some objects, and perhaps 'to correct' information, if they believe it to be wrong.


Now that I've outlined the framework for how museum object data might be presented in a Web site, I'll return to the practicalities of producing a prototype for demonstration. At the time of writing, I've received object data resulting from the Freedom of Information requests from four museums. The data arrived in different formats, with different types of meta-data, requiring some work to compile these sources together.

One list of objects came as a folder full of Word documents, each containing a table with the object data - not the easiest format to import into a database, but a bit of copying and pasting into a spreadsheet, then exporting as a comma-separated-values file, made it more or less machine readable. Another practical problem came from a museum which sent me their object data as a huge, 1GB+ xml file. I couldn't even open the file without my machine crashing, and so the solution was to run a command to split the file into a few hundred smaller files that were easier to work with. One reason to outline these issues is to make the point that it doesn't matter too much which format a museum makes their data available in. Some formats are easier to use than others, but the determined developer can usually find a way, and so there's little excuse in practical terms for not making this data available.

The second challenge I've faced in pulling together data for a prototype is not having data available in a structured format. One data set, for example, had a 'location' listed for a good proportion of the objects, but these locations were listed in a free text field rather than containing a reference to country or town. Even at a country level, countries were referred to in different ways ('Republic of France' vs 'France', etc). Going through every record and picking out which country the location referred to would have been a huge job, so the solution here is to take a pragmatic approach. Rather than going through every record, we can just generate a list of all the unique location strings, order them by how frequently they occur, and then start at the top. In this way, within an hour or so I was able to convert most of the locations into structured references to a country. Locations which only occurred once or twice were discarded. This means having slightly less data, but it's a worthwhile compromise that quickly gives some useful data to start with.

For construction of the prototype Web site, I've used the Web framework language Ruby on Rails, which allows data-driven Web sites to be quickly built and rapidly iterated. To generate graphs and charts which help to visualise the data, I've used the Google Charts API, which lets you simply pass variables to Google and receive back an image file containing your chart. To illustrate where objects come from in the world, I've developed a custom Flash animation which takes a dynamically-created XML file and colours in the countries of the world accordingly. None of these is particularly complex, and I'm sure there's a lot more that could be done in the way of clever visualisations, but I think this a good enough start to illustrate the ideas that this paper introduces.

Initial Observations

Having constructed this framework and begun to import museum object data into it, I am already able to make some initial observations. First, holes in the information available are fairly apparent. Whilst there is data available for hundreds of thousands of objects from one museum, the proportion of these that have both an acquired year and a country of origin is fairly small, making it harder to visualise where and when the objects entered the collections. These data holes are unfortunate, and emphasise the importance of a well-kept set of object records.

Despite the holes, though, there's already enough data to allow making some useful and interesting assessments. A simple graph showing how many objects were acquired each year for different museums is instantly quite appealing, and the trends are easy to spot. Likewise, a map showing where in the world a museum’s objects have been acquired from is also interesting - and for one museum, shows a clear bias towards coastal countries.

One interesting, and perhaps obvious, flaw to point out, though, is that these visualisations treat all objects as equal. That is, for the purposes of the map visualisation, a count is simply kept of how many objects have been acquired from each country. Countries with a higher count are shaded in a darker colour. This gives a reasonable account of the proportion of the collection that comes from different countries, but can easily be skewed. Two hundred small coins end up counting for far more than a single large train. In some ways, this is fine, as the collection of coins is in some ways bigger than the single train. However, it may be interesting to explore ways of weighting collections based on other factors, such as physical size, or perhaps even some measure of 'importance' (how this could be measured, I leave as an exercise to the reader). Physical dimensions are often stored in object records, but parsing this data into a machine-readable format would require a fair amount of work.

Finally, it's worth making the point that the objects in this prototype are all described in text form only. It's typically thought fairly critical to include photos of objects when representing them on-line, and taking these photos can consume a huge amount of time, effort and money. Because this paper examines an approach for representing a huge volume of objects - a 'quantitative approach'- though, and focuses on the aggregation pages, the absence of images has less of an impact. If images were to be added, it perhaps would be less useful to have photos of the individual objects; instead, photos could be added to loosely represent the collections of objects - types of objects, objects from a particular country, etc. If this were the case, there would be less need to make sure that images were paired up precisely with the objects photographed in them, and so you could allows users to contribute photos - such as those taken in the museum.


The purpose of this paper is not to describe a finished piece of work; rather, to explore a direction in which museum object data could travel. The core of this idea is to open up the data that museums hold about their objects, and to expose the data in interesting ways that can be used to make observations about a museum's history. The work of creating the visualisations and overviews doesn't even have to be done by the museum itself - if the data is made public (or indeed, officially requested), interested third parties can interpret the data themselves. This spirit of openness allows museum objects to form part of the 'web of data' that is now emerging, and holds great potential to engage the public in the debate about the history and future of museum collections.

Whilst the most famous Web services have so far been driven by user-created content, there is a growing trend to place public information in the 'Web of data'. Examples from the UK include the excellent work of MySociety (, which has moved political information such as the voting records of MPs and the debates from parliament into an open Web environment. There is no reason why museum data cannot exist here too.

Looking forward, once it is possible to compare collections from different museums, there is also the scope to look at private collections - of which the principal source may be eBay, where millions of collectable objects are acquired every week (indeed, sometimes by museums).

I'd like to conclude by pointing readers towards the on-line manifestation of the research that this paper is based on. It can be found at (from April 2008). The prototype is, of course, in perpetual 'beta', and your feedback is very welcome.


I'd like to thank all of the museums which graciously responded to my Freedom of Information requests for their object collections information, particularly those who took the time to export and send me their raw data. I'd also like to thank my colleague Anne Prugnon, who helped me with the Flash map code, and my colleagues Daniel Evans and Mike Ellis for their general support.


Chan, S. (2007). “Tagging and Searching – Serendipity and museum collection databases”. In J. Trant and D. Bearman (eds). Museums and the Web 2007: Proceedings. Toronto: Archives & Museum Informatics, published March 31, 2007 at

Coates, T. (2006). “Native to a Web of Data”. Presentation, first delivered at Future of Web Apps 2006, London. Retrieved January 31, 2007, from

Engeström, J. (2005, April 13). Why some social network services work and others don't — Or: the case for object-centered sociality. Retrieved January 31, 2007, from blog:

Engeström, J. (2007, April 13). What makes a good social object. Retrieved January 31, 2007, from blog:

Haines, L. (2005, October 14). Google Earth: the black helicopters have landed. Retrieved January 31, 2007, from The Register:

Cite as:

Roberto, F., Exploring Museum Collections On-line: The Quantitative Method, in J. Trant and D. Bearman (eds.). Museums and the Web 2008: Proceedings, Toronto: Archives & Museum Informatics. Published March 31, 2008. Consulted roberto/roberto.html