MW-photo
April 9-12, 2008
Montréal, Québec, Canada

Uniting The Shanty Towns - Data Combining Across Multiple Institutions

Sebastian Chan, Powerhouse Museum, Sydney, Australia

Abstract

This paper reports on the early phases of a pilot project which is mashing up datasets from several museums with data from other government departments. Can we liberate museum data and combine it with other live external data sets to create new value and use? If so, what kinds of meaning can we help people make? What do we need to do to the underlying museum data? What are the technologies that allow us to aggregate this data long after it was created for other purposes?

Keywords: smashup, datasets, reuse, museum data

Introduction

In the cultural sector we are cautious, careful, meticulous. This is a virtue - it ensures that our physical collections are preserved for future generations. But when applied to all areas of the business, this same virtue inhibits our ability to make the most of emerging opportunities - especially in digital space - in a timely manner.

Consumers now have a large range of information choices available to them on-line, and the longer museums wait to make their rich content widely accessible, the more difficult it will be for it to have maximum reach. Access now means far more than simply making content available on a corporate Web site. Content now must be usable and have some degree of personalisation for the behavioural mores of different customer groups. Looking into the future, it will also need to be portable and accessible via many different devices and in new ways.

Usability, personalisation and portability are difficult, both technically and in terms of policy.

Our Web sites have been built over long periods of time and on an incremental, project-by-project basis. The result is that most of our Web sites architecturally resemble a lively shanty town - functioning, vibrant, and teeming with life, but with no coherent design aesthetic at the structural level.

Traditionally the approach to Web shanty-towns has been to impose a top-down 'redesign' solution - usually a content management system. Whilst a content-management system can make significant inroads into 'better management' especially in museums where Web development skills have been outsourced and Web teams are non-existent or made up entirely of producers and editors, it is never a permanent solution. Even after institutions make the move to a centralised content-management system, there is inevitably important content left in legacy microsites - content that is too difficult to port and too important to take offline.

We have recently seen, with the disruptive impact of emerging social Web technologies, that content-management systems with their traditional 'enterprise' focus can make agile responses to these new technologies impossible. As a result some large organisations with technical development teams, like San Francisco's Exploratorium and Sydney’s Powerhouse Museum, have preferred to deploy content-management systems in limited ways across parts of their sites to retain flexibility and agility.

Taken individually, these different approaches to site design and information architecture can work well for their host institutions, but taken collectively, they make cross-institutional search and data combining very complex.

Is it possible to build cross-institutional initiatives without needing to compromise the institutions' own approaches? Is a bottom-up, consumer-focused approach to data access possible without top-down standardisation?

About NSW

In 2007 the New South Wales (NSW) State Government commissioned the Powerhouse Museum to build and pilot a series of lightweight prototypes to explore the potential value of data held in the cultural sector when combined with data from other areas of government. Known as About NSW, this project uses an agile development methodology to build a number of cross-sector initiatives in a short time. Amongst these are a pilot for a calendaring application, an education resource cataloguing and exposure application, and a geo-location-aware federated collection search.

About NSW is working with pre-existing publicly available data from the state's major cultural institutions - Powerhouse Museum, the Historic Houses Trust of NSW, the Art Gallery of NSW, the State Library of NSW, the Australian Museum, the NSW Heritage Office - as well as other NSW Government departments. The project is also exploring connections with complementary data held by third party repositories: Libraries Australia, Picture Australia, People Australia, Collections Australia Network, the Database of Australian Artists Online, and the Dictionary of Sydney.

This paper explores some of the About NSW prototypes and suggests that the cultural sector can find new value in old content through recombination with other datasets, and reach new audiences through better data visualisation. This can be a highly productive approach for organisations with large investments in rich content in times when funding for the creation of 'new' content is scarce.

How Popular Is Our Existing Content?

Despite impressive self-reported on-line visitation figures, cultural sector Web sites represent a tiny fraction of total Internet usage.

In Australia the largest cultural sector organisations - museums, galleries and libraries - are government funded, and their Internet presences tend, at least historically, to be in the .gov namespace. As such, ISP-side measurement company Hitwise Australia (http://www.hitwise.com.au) counts their traffic in the 'government' category.

A look at Australian Internet usage for the week ending January 19, 2008 showed that government Web sites as a whole represented 4.815% of Australian Web sites visited, and 2.319% of all Web sites visited. The most popular government Web site in Australia is the Bureau of Meteorology which commands over a 25% share of traffic to government Web sites (or 1.2% of all Web sites).

The top 50 NSW State Government Web sites represent only 0.64% of the total content viewed by Australians. Of this small percentage of total traffic, less than 4% went to the major cultural institutions. The Powerhouse Museum recorded 0.84%, the Australian Museum 1.16%, the Art Gallery of NSW 0.40%, Taronga Zoo 0.55%, whilst the Historic Houses Trust did not make the Top 50.

The most popular NSW Government Web sites were, unsurprisingly, the combined public transport network of buses, trains and ferries (~17%), the Roads and Traffic Authority (~10%) which provides motor vehicle registration services, and NSW Lotteries (~7%) which publishes lottery results.

These figures should not be surprising.

The low representation of cultural sector Web sites is due to the nature of the Internet as a whole. The majority of Internet activity revolves around communication, search, IT companies, entertainment, shopping, news, sport, gambling and adult services.

Cultural sector Web sites focus on attracting visitors to their physical sites. This focus has limited the exposure and impact of much of the other rich and diverse content which these organisations generate. This rich content could tap into rich veins of traffic around entertainment and social communication. With physical visitation limited by geography, tourism patterns and fluctuations in exhibition content, there is a cap to the total on-line visitation reachable by operating cultural sector Web sites purely with a focus on physical visitation.

Museums struggle with strategies for converting 'casual Internet visitors' to 'physical visitors'. Unpublished internal research shows that even as late as 2006, newspaper advertising and word-of-mouth overwhelmingly dominate as primary communication channels with key audiences. At the same time, the pressure to increase 'self-generated' revenue in the short-term and reduce dependence on government funding does little to encourage the exploration and exposure of other content types for which revenue models are still in their infancy.

For those who have experimented, the 'long tail' of museum content already provides a significant traffic driver, especially for the larger pool of international audiences in North America and Europe. This is heavily reliant upon collection-related content being made available; being presented in a highly usable form; and being optimised for search engine exposure. Whilst both the Powerhouse Museum (with high on-line interest in fashion, numismatics and transportation) and especially the Australian Museum (with high on-line interest in spiders, fish and Australian animals) exploit their content long tail effectively with international audiences, neither has found a suitable way of converting this traffic to revenue in any significant way except through encouraging physical visitation - which assumes these international visitors are also near-term tourists.

The Other Side Of Traffic

The institutions already analyse the behaviour and intentions of traffic that gets to their Web sites. But they don’t have the resources to look at the traffic that might best match their content but never reaches their site. Much like local real world audiences who are looking for somewhere to take the family but never think to go to a museum, these audiences need to be approached and guided to appropriate content. So, where do they go instead?

With the low intentional usage of government Web sites for information seeking, the About NSW team examined the behaviour of local audiences in selecting Web sites with 'educational' or 'reference' content. Not surprisingly, because of the dominance of search engine results, for Australian audiences Wikipedia (http://www.wikipedia.org) represents around 25% of this traffic. Partially this is in the way that Hitwise measures traffic - conflating the users who visit Wikipedia to look up Britney Spears and those who are looking up kangaroo (although arguably both are 'educational' in different ways).

Given that Wikipedia pages rank very highly in Google search results, and that Wikipedia is very likely to be the first port of call for general knowledge questions, then regardless of your opinion of Wikipedia (and what teachers and librarians might discourage), the real questions are, How much of the state's cultural and historical content is represented in Wikipedia? And how popular is it? By answering these, it becomes possible to use Wikipedia as a reasonably reliable microscope through which to examine the 'demand' side of Internet content. (Spoerri, 2007)

By taking data from a local Wikipedia mirror we were able to datamine the prevalence of relevant state-related people, places, historical events and 'things' held in Wikipedia at the time of the snapshot. Further, we were able to examine their relative popularity and extrapolate this to produce a popularity index. WikiCharts (http://tools.wikimedia.de/~leon/stats/wikicharts/) can be used to perform a similar analysis but without the depth of data-mining.

Mapping the most popular and relevant topics in Wikipedia to the content held by the state institutions makes it possible to generate a list of high-demand content. This list is based on an approximation of user demand rather than on the traffic that currently reaches each institution. However, acting on this information requires editorial choices to be made.

Obtaining Data

Once choices are made about which content best matches latent demand, the problem becomes one of obtaining data.

There are two main problems with obtaining data dynamically. The first is that on the shanty-town Web sites, data are not uniformly formatted or structured. The second is that on Web sites with content-management systems, often data in a usable export format such as XML require 'additional development' or the purchase of 'additional modules'. Even if data were to be exportable, the common problems of field mapping would usually still inhibit data sharing.

The About NSW solution has been a very practical one - a 'best guess'. Acknowledging that regardless of policy directives, cultural sector data will inevitably be presented in different ways, and the same types of things described in different ways, we have been using screen-scraping and complex html-parsing. Data are scraped from the relevant sections of partner Web sites using Mechanize, and then the messy HTML is parsed with Beautiful Soup. With some custom configuration and regular expression scripting, these tools do a 'good enough' job of harvesting semi-structured information from most Web sites and storing it in a structured database.

An Example - Scraped Calendars

One of our key prototypes is a cross-sector calendaring service which aims to unite the various site calendars that exist across different cultural sector agencies. Ideally agencies would agree to have similarly formatted RSS feeds from their event and exhibition calendars; however the reality is that none of the major cultural institutions in NSW have an RSS feed of events and exhibitions. Some, including the Powerhouse Museum, do not even store events or exhibitions in a database.

Worse still, each agency describes events and exhibitions in different ways. Some have events with sub-events (for example a festival which has a start and end date comprised of sub-events); some have events that recur in non-standard ways; others even change the way they present date and time data about their events. None use microformats within their events calendar code, and HTML code is of varying quality.

However, for promotional purposes each agency wants to have its calendar spread widely to newspaper portals (for example Citysearch: http://www.citysearch.com.au) and other on-line aggregators who target niche audiences (for example Kidspot: http://www.kidspot.com.au, MyTickets: http://www.mytickets.com.au, Upcoming.org: http://upcoming.yahoo.com). Agencies manually contact these portals by emailing publicity contacts, or by simply waiting for these third party sites to come and collect the data from them. This is sub-optimal from a publicity perspective as well as from a technical perspective.

Each agency would benefit from being able to have an RSS feed of events as well as supporting iCal, yet none has the capacity or motivation to deploy this as a high priority.

About NSW is able, however, by scraping and parsing their existing content to deliver back to the institutions RSS and iCal structured data as well as providing a backend administrative toolset for agencies to 'clean up' their scraped data if they so wish. These RSS and iCal feeds can then be aggregated more easily by commercial content portals, as well as used by the institutions for other purposes.

Although each institution has categorised its events in different ways for its own sites, the data can be harmonised externally on About NSW where it is reorganised on stripped down criteria for a general audience. Because each harvested event carries an 'origin URL', users can click through to the partner Web site for more detail and event specifics.

In the future it is possible that About NSW will provide a calendar widget effectively offering agencies the ability to use About NSW to store all their calendar-based content and then deploy this to their own site, eliminating the need for scraping.

This 'opt-in centralisation' approach offers the benefit of a 'try-before-you-centralise' approach as well as the interim benefits of a return feed of well-structured, microformat-ready data obtained from their existing mess. It also provides a useful test case for a whole-of-government implementation.

Another Example - Education Resource Finder

Also ripe for content aggregation are teacher and education resources. Each cultural sector Web site creates hundreds of these as HTML pages, PDFs, microsites. They catalogue them in different ways, call them different names, and expect time-poor teachers and answer-hungry students to blunder their way through them to find what they are looking for.

In the education sector [through Education Network Australia (EDNA – http://www.edna.org.au) and in NSW the Department of Education and Training's Teacher and Learning Exchange (TALE – http://www.tale.edu.au)], education specialists and educators themselves have built sophisticated searchable databases of relevant content. Whilst EDNA can be accessed by students and teachers from their homes, TALE is predominantly a walled-garden Intranet system and cannot.

About NSW is building another aggregator and cataloguing tool for these resources, and connecting them up across the institutions around user-centric topics and themes. Resources are being connected with visit trails - encouraging real-world excursions as well as content access. Through exposing them in this way they can be better indexed for niche audiences and also exposed to wider audiences.

Consumer-Oriented Location-Aware Data Access

Adrian Holovaty quickly rose to fame for Chicago Crime, (http://www.chicagocrime.org) a project that made 'usable' a mess of data provided by the Chicago Police Department (http://gis.chicagopolice.org/). Whilst the Chicago PD had a statutory responsibility to release data about crimes, it had no responsibility to make it 'user-friendly'. One of the project’s great successes is that it doesn't presume a 'correct' way of filtering crime data. Users can browse crimes by type, date, location or any permutation thereof. It was also possible to use a GoogleMap interface to map out a route through a neighbourhood (for example, the daily path you might walk to the station, or to do the shopping) and see the frequency and type of crimes that had occurred along that particular route.

Unfortunately it is rare to find the public release of this kind of data being mandated in the way that Chicago PD was. Another project, this time by Stamen Design, called Oakland Crimespotting (http://oakland.crimespotting.org/), tried a similar approach to crime in Oakland, only to find that the government department sites from which it was scraping data cut their feed. Stamen is now working with Oakland City to restore access to data.

Most democratic governments already release enormous amounts of data to their publics, but rarely is it in a consumer-friendly form. This is often not intentional, but the result of the large and sometimes archaic enterprise systems in which this data is stored.

The cultural sector should know its audiences well. Museum Web sites undergo rigorous evaluation and emerge redesigned on a regular basis. But the cultural sector is not always the best judge of how other non-cultural sector audiences might benefit from its datasets. Nor is there enough research into how users might wish to traverse collections from multiple institutions using an artist's name, a particular date or a place as their guide.

There are good examples from the library world of ways that enterprise data can be recontextualised and given new life. Much in the same way that some library catalogues use the Amazon database (http://aws.amazon.com/) to pull in images and related works for library patrons, commercial projects like Library Thing (http://www.librarything.com) now connect to library catalogues to make it easier for their customers to form 'collections'.

An Example - Federated Location-Centric Search, Mysuburb

For the About NSW project pilots, we chose to focus on making a cross-sector collection search location-aware. The broad concept behind the MySuburb prototype is similar to that used by Holovaty's most recent commercial project, Everyblock (http://www.everyblock.com). Where Everyblock is a location-sensitive news aggregator which shows all the news related to a 'place', MySuburb uses the same start point - where you are - to begin a federated search of cultural collections. Similarly, with the growth of location-aware mobile devices, institutions in the cultural sector need to explore the issues involved in making their data location-aware so that they can best prepare for future delivery methods.

There are several complex problems needing to be solved to make this work in a customer-centric and seamless manner.

First, very few cultural collections are geo-tagged. If they are, then they are tagged in different ways. Taking the Powerhouse Museum's own collection, we discovered that out of nearly 70,000 collection records publicly available on our Web site, only 17,000 had any data in a set of fields relating to 'place'. Of those 17,000 with data, there were only 1,100 unique places. Because places are not recorded in the Powerhouse collection management system with a controlled vocabulary, there is great inconsistency in how place data is recorded. Some objects have a granularity down to the suburb level whilst most are only down to the city or even state level.

Further, in the Powerhouse collection there is a wealth of 'unstructured' place-related data - just not in the 'place' field. Instead, places, even down to the street level, are stored in context in rich statements of significance or in production notes.

Over at the Australian Museum, species data are publicly available based on collection locations - not a good indicator of the locations of living populations. And at the Art Gallery of NSW, artists have limited public location data attached to them, but not to the works that they have produced throughout their career. In contrast, the NSW Heritage Office has street addresses of all of the registered properties which have heritage value.

Cleaning Up Place Data

For those objects that have structured place data, we need to perform some cleaning up tasks before determining their latitude and longitude. By harnessing several publicly available Web services, About NSW is able to make better sense of incomplete place data. First, data are scraped from existing fields and fed through the Flickr API (http://www.flickr.com/services/api/), to return as structured XML the possible locations with parent data. In this way we are able to turn 'Sydney' into 'Sydney, New South Wales, Australia'. Flickr does not offer a latitude and longitude resolution service, so the Geonames service (http://www.geonames.org/export/) is used to convert this improved location data into latitude and longitude co-ordinates. These are then able to be mapped on Google Maps and manipulated mathematically to be able to show other content within a specified radius.

For objects with unstructured place data that exists only in large blocks of text, the process is more complex. We are first piloting with the Powerhouse collection a process which involves the parsing of the descriptive fields in object records for 'possible' place matches, based on regular expressions. Once processed, we are then able to compare each possible place with a list of 'known' places. Currently we use the NSW Department of Lands place database, the Geographical Names Register of NSW (http://www.gnb.nsw.gov.au/name_search) which contains over 80,000 placenames including historical data and alternative namings. In the future it would be possible to use the same methodology to include other lists of placenames such as indigenous placenames, as well as contested maps and territories.

This newly cleaned structured data can also be returned to the participating institutions to improve their own datasets within their own organisations. Improved geo data are of enormous benefit to institutions beginning to conceptualise their own location-aware projects and will allow them to be better ready for location-sensitive data delivery ( Bearman and Geber, 2007).

Cross-Linking Cultural Content On A Map

The project team is well aware that simply making cultural content available on a map interface is not enough. To tap into other audiences, the project is experimenting with mapping other publicly available data around known location-centric interests - news content, demographics, real estate information - tapping practical concerns and using these as alternative entry-points into layers of cultural content.

These datasets are visualised on a combination navigation map and chloropleth.

Once the data become available, users will be able to visit MySuburb and map heritage buildings, discover famous historical figures and events from their suburb, discover flora and fauna in the local environment, and then use all the data as starting points to explore cultural collections.

Overlaid across this data can be other demographic content.

We are initially experimenting with baby name data from NSW Registry of Births, Deaths and Marriages. This information is very popular and is publicly released as a table of popular names each year on the Births, Deaths and Marriages site (http://bdm.nsw.gov.au/popularBabyNames.htm). We have built a simple tool to allow it to be explored interactively, and it will soon be mappable - showing popular birthnames by year and place - revealing the patterns of community naming through time and space. We are exploring whether names can then also be used as a casual entry point to cultural collections; for example, 'show me all the work held in the Art Gallery by Australian artists named William' from my suburb.

Conclusion

About NSW provides a case study for the potential of cultural sector data. Freed from the usability problems of enterprise systems and connected to other similar data held elsewhere, About NSW is enabling wider exposure of the state’s collection data, events calendars and educational resources.

Whilst traditional approaches would mandate a complex process of top-down standardisation for known audiences, the About NSW approach is a more practical and less restrictive and cumbersome, more agile bottom-up process. The project acknowledges that it does not know how users will eventually end up using the data made available, but instead attempts to make the data available in as many formats as possible and with as little impact on the source institution as possible.

The project is still very much in its infancy. Future research is required to assess its success or failure, as well as detailed investigative work to find out how data end up being utilised.

Acknowledgements

This paper would not have been possible without the work of the extended project team: Paul McCarthy (NSW Department of Commerce), Renae Mason (Powerhouse Museum), Daniel McKinlay (Powerhouse Museum), Dr Greg Turner, Luke Metcalfe, and Adam Ullman.

References

Bearman, D., and K. Geber. “Enhancing the Role of Cultural Heritage Institutions through New Media: Transformational Agendas and Projects”. In International Cultural Heritage Informatics Meeting (ICHIM07): Proceedings. Ed. J. Trant and D. Bearman. Toronto: Archives & Museum Informatics, 2007. Published September 30, 2007 at http://www.archimuse.com/ichim07/papers/bearman/bearman.html

Spoerri, A. “What is Popular on Wikipedia and Why?” In First Monday, volume 12, number 4 (April 2007), 2007. http://firstmonday.org/issues/issue12_4/spoerri2/index.html


Cite as:

Chan, S., Uniting The Shanty Towns - Data Combining Across Multiple Institutions, in J. Trant and D. Bearman (eds.). Museums and the Web 2008: Proceedings, Toronto: Archives & Museum Informatics. Published March 31, 2008. Consulted http://www.archimuse.com/mw2008/papers/chan/chan.html