Archives & Museum Informatics: MW99

Archives & Museum Informatics
2008 Murray Ave.,
Suite D
Pittsburgh, PA
15217 USA

info@archimuse.com
www.archimuse.com

Join our Mailing List.

Published: March 1999.

How Forcible are Right Words! *: Overview of Applications and Interfaces Incorporating the Getty Vocabularies

Patricia Harpring, Getty Information Institute, USA

Introduction

The three Getty vocabularies - the 'Art & Architecture Thesaurus®, the 'Union List of Artist Names®, and the 'Getty Thesaurus of Geographic Names™ - were initially used by museums, libraries, and archives to control terminology in catalog entries. In the last several years, they have increasingly been used to provide access across disparate data sets in networked environments. This paper will provide an overview of the various ways in which the Getty vocabularies may be used in specific implementations. How may the vocabularies be integrated into user interfaces? What are the data elements and structure of the vocabularies that enhance retrieval? What problems arise in using vocabularies in Web applications? This paper will explore these and other issues that arise in implementing the Getty vocabularies.

What Are the Getty Vocabularies?

The Getty vocabularies are collections of names and other information about people, places, and things in the realm of art and cultural heritage, linked together to show relevant relationships. The vocabularies are thesauri intended to supply scholarly information about - and to be sources of terminology for cataloging and retrieving records about - visual arts and cultural heritage.

The 'Art & Architecture Thesaurus® (AAT) is a thesaurus of terms and other information used to describe and catalog art objects, architecture, decorative arts, archival and textual materials, images, and material culture. The temporal range of the AAT stretches from antiquity to the present. The 'Union List of Artist Names® (ULAN) contains names, nationalities, dates, roles, and other biographical information about identified individuals or groups of individuals working together in the conception or production of visual arts and architecture. Coverage in ULAN is global, although currently there is a preponderance of Western European and North American "makers." The scope of the ULAN ranges from antiquity to the present. The 'Getty Thesaurus of Geographic Names™ (TGN) is a vocabulary composed of names, place types, coordinates, and other information about geographic places. The TGN currently focuses primarily on the modern world, although there is a growing number of records describing archaeological sites and other historical places.

Figure 1: Some of the elements of an AAT record.

Figure 2: Some of the elements of a ULAN record.

Figure 3: Some of the elements of a TGN record.

Vocabularies are often necessary to catalog and gain access to automated information because the names used to refer to a given person, place, or thing may be different in various languages and may change over time. Also, there is often confusion regarding which term refers to which concept. The Getty vocabularies collocate various names that refer to the same concept, and link concepts to broader and narrower concepts through hierarchical relationships, in addition to providing scholarly, historical, and other information about the person, place, or thing.

Information in the Getty vocabularies is contributed by Getty projects and outside contributors. Some of the current contributors to the Getty vocabularies include bibliographic and documentation projects, such as the Bibliography of the History of Art (BHA), the Getty Provenance Index, the Getty Vocabulary Program, various projects in the J. Paul Getty Museum, the Getty Research Institute, and the Getty Conservation Institute, the Canadian Centre for Architecture, the Frick Art Reference Library, the Smithsonian Museum of African Art, the Mystic Seaport Museum, and the Conservation Department of the Harry Ransom Humanities Research Center at the University of Texas at Austin.

The Getty has been involved in building vocabularies since the mid-1980s. The role of the Getty in the compilation of the vocabularies has been primarily to provide content. However, the Getty has also been involved in a limited number of research projects to demonstrate the utility of the vocabularies as search assistants. As the Getty vocabularies continue to be more widely distributed, future experimentation in this area will likely also include broader involvement by industry leaders in the field of search-and -retrieval technology.

How Are the Getty Vocabularies Used?

The Getty vocabularies may be used for three purposes: 1) as "knowledge bases" for researchers wishing to learn about the concepts they describe; 2) to supply vocabulary for catalog records for art and cultural heritage; and 3) to supply names (including variant spellings, names in various languages, and historical names) for use in retrieval tools to gain access to art and cultural heritage information across different resources in digital form.

Knowledge Bases: In their role in providing information for researchers and other interested users, the vocabularies are typically accessed through the Web. For example, the Getty vocabularies are currently released in Web applications hosted at the Getty (generally known to the vocabulary users as "browsers"). They are used by various Getty projects and many other institutions for research and to aid in making catalog records. The general public also uses the vocabularies on the Web heavily. Judging from the IP addresses of the users and from their correspondence to us, the queries come mainly from universities, libraries, and other scholarly and academic researchers. Some queries are from individuals doing personal research on, for example, genealogy or in planning a trip. At minimum in such "browser" implementations, users should be allowed to search by spelling a term or to explore for terms by navigating through the hierarchies.

Querying by spelling the name or term is the method used most often to access information in the vocabularies. It is helpful if users can access vocabulary records by truncating names or by looking for individual words (keywords) in a name, for example, finding "Gentileschi, Artemisia" in ULAN by looking for the truncated "Gentiles*" (where the asterisk is a wildcard), or finding "Champlain, Lake" in TGN by looking for keywords "Lake AND Champlain."

card.
Figure 4: ULAN Web application, http://www.gii.getty.edu/ulan_browser (accessed January 14, 1999). A search by name is truncated with a wild

To provide greater access to the information in the vocabularies, additional search criteria are desirable. Narrowing the search to a particular part of the hierarchy would be helpful. For example, in the TGN there are more than 480 places with "Washington" as a component of the name. Narrowing a search to a particular state in the US would allow users to retrieve a more manageable result set, thus making it easier for them to find the particular Washington that interests them. In the ULAN, it would be helpful to narrow queries by the nationality or ranges of life dates for an artist. Other information in the vocabulary record could also provide useful access, for example ranges of geographic coordinates in the TGN or the text of the scope notes in the AAT.

Another issue in querying the vocabularies is how to show the information once it is retrieved. Since the vocabularies are very rich and complex, decisions must be made regarding how to display the information without confusing or overwhelming the user. The Getty browsers typically present the information in three ways. An initial results list includes a brief reference to each concept (for TGN, a "preferred" name, place type, and hierarchical context). From here, the user can either view the full record for the concept, or view the concept in the full hierarchical display. Displays are designed with the goal of presenting as much information as necessary in a clear and coherent way.

Figure 5: TGN Web application, http://www.gii.getty.edu/tgn_browser (accessed January 14, 1999), results list. Scrollable results list shows names of places that met the search criteria, plus enough information to identify each place.

Figure 6: TGN Web application, full record display. User may scroll down to see coordinates, notes, names, place types, and bibliography for the TGN record. The display is arranged with the information typically most sought after at the top of the screen.

Figure 7: TGN Web application, hierarchical display. User may view the place in the full hierarchy and browse the hierarchy from there. Since the hierarchy is so dense, not all levels of the hierarchy are displayed at once.

To enhance and supplement information in the vocabularies, links could be made to other relevant resources on the Web. For example, linking TGN records to maps or ULAN records to examples of an artist's work would be useful.

Cataloging Aids: In their role as aids to catalogers, the Getty vocabularies have long been used in hard copy to control vocabulary; this is particularly true of the oldest vocabulary, the AAT. In recent years, the vocabularies have been integrated into collection management systems to allow easier access. Since the goal of a cataloger is to find the most appropriate term for his use, a vocabulary browser is generally a component of such a system (e.g., in The Museum System by Gallery Systems).

Figure 8: Screen from the Museum System by Gallery Systems (information is available at http://gallerysystems.com, accessed January 15, 1999), TGN hierarchical display. The display shows the hierarchical position of the target place on the left and its variant names on the right.

Since the audience of a collection management vocabulary browser is narrower than that of a general use browser, the issues are slightly different. In a general use browser, the developers cannot accurately anticipate all the questions the general public may want to ask of the database. Therefore, the main issues usually have to do with how to provide the most versatile access to the data without making an overly complicated interface. By contrast, users of a collection management or other cataloging system are more likely to ask the same sorts of questions of the vocabulary at predictable places in the object's catalog record. In order to increase speed, efficiency, and consistency in cataloging, implementors may find it useful to limit the number of terms available at any given point in the catalog record. For example, in a system that incorporates the AAT and is used to catalog an art collection, for the field that captures the object type, users are likely to want terms from the AAT Objects Facet - for example, "painting," "photograph," "drawing," or "altarpiece." It may be desirable to point the cataloger to pertinent parts of the AAT, perhaps restrict or prohibit access to irrelevant hierarchies (for example, People, Organizations, and Events), or even to build a pick list of appropriate AAT terms.

Another issue that arises in the application of vocabularies in collection management systems is the fact that a particular vocabulary term needed by a cataloger may not be included in the Getty vocabularies. First of all, a museum may have objects in its collection that are not strictly classified as "art," and thus terms to describe them may be outside the scope of the Getty vocabularies. For example, medical terms such as "intravenous" or "typhoid" are outside the scope of the AAT, but may be important to indexing an object in a particular museum. The names of historical or mythical people, events, and iconography (stories represented in art works) are generally necessary for indexing art objects, but are outside the scope of the Getty vocabularies. Therefore, an implementor may want to provide access to ICONCLASS and other relevant sources of additional vocabulary.

Furthermore, even for concepts within the scope of the Getty vocabularies, some may be missing because these vocabularies are compilations of terms used by the contributors; they are not comprehensive (although they grow thanks to contributions). Therefore, implementors will probably need to provide users with a way of adding additional terms for their local use. Given that many users may want to then contribute pertinent terms to the Getty vocabularies, the implementors may wish to supply users with a way of extracting candidate terms and submitting them to the Getty.

Search assistants: The vocabularies are increasingly being used in search engines to gain access to materials that exist in different databases (or even in the same database) and may have been indexed using different terms to refer to the same concept. The first step in an implementation of a vocabulary as a search assistant is to allow the user to locate appropriate terms. Ideally, the information in the vocabularies would be accessible by multiple criteria, as is desirable for a browser or collection management system. Once the correct vocabulary record is located by this method, the next step is to utilize the most important elements of that record, especially the names and the related concepts, for application in a retrieval tool. For example, the TGN supplies the various names for Lisbon, Portugal, that can be gathered and used in a query for data about that city: "Lisboa," "Lisbon," "Lisbonne," "Lissabon," "Olissibona," "Ulixbone," "Luzbona," "Lixbuna," "Felicitas Julia," and "Olisipo." Likewise, the AAT provides variant forms of terms (e.g., "kylikes," "kylix" and "cylices"), and the ULAN provides names in various languages and name changes (e.g., for the painter Bartolomeo Bulgarini, the variants "Bartolomeo Bolgarini," "Bulgarini da Siena, Bartolommeo," "Lorenzetti, Ugolino," and "Master of the Ovile Madonna").

The vocabularies may also be used to suggest possible terminology to an end-user who may not know the name of a particular concept. The thesaural relationships of a vocabulary may be used to gather children, or to suggest parents, siblings, or related concepts. For example, a user interested in the wall paintings known as "frescoes" may also be interested in "sinopie," the drawings under the fresco; the AAT suggests this as a related concept. A user interested in Tuscany, Italy, may want to look for the names of any town in Tuscany; the TGN provides a list of these towns. A user interested in the architect Le Corbusier may also be interested in information about his teachers, and the ULAN suggests the name of Charles L'Eplattenier.

Application of the Getty vocabularies (or any vocabulary) in search assistants for art and cultural heritage databases is relatively new, but lessons can be learned from current examples. For instance, an early demonstration project was "a.k.a," designed at the Getty to provide access across various Getty databases. The concept was refined and expanded in ARThur, in which access to databases is provided by vocabularies and by image comparison. These interfaces are currently used to provide access to various sets of databases, including those in "Faces of L.A." and "American Strategy."

"a.k.a." (standing for "Also Known As") experimented with strategies for gathering and manipulating terms from Getty vocabularies, and using the terms to broaden or narrow searches across databases on the Web. Issues included how to make the concept of using vocabulary in this way comprehensible to the non-expert end-user. Users were generally accustomed to simply thinking of a term, typing it, and launching a search on the Web; allowing them to first interact with a vocabulary in order to choose terms for a search was often confusing.

At certain points in the early "a.k.a." implementation , the broadening of searches by term variants was done automatically for the end-users, thus trying to satisfy those who do not want to bother with interacting with a vocabulary. In the example below, the user first chose which databases he wanted to query, and then could 1) use no vocabulary, 2) use vocabulary in a hidden or automatic way, or 3) query the vocabulary and choose the particular terms that he wished to apply to the search.

Figure 9: Screen from an early version of "a.k.a." (this version is no longer supported). User was provided with pull-down menus to choose 1) which databases to query and 2) which vocabulary (if any) to use to enhance the search.

When vocabularies are employed in an automatic way, issues include how many related terms to include in the query. Should only synonyms and language equivalents be included? Should "see also" references be included? Should narrower hierarchical concepts be included? While all of these related terms could potentially be useful when applied in a query, there is the danger that too many terms will be employed and thus make the search too time-consuming and the result set too large. In order to maintain acceptable retrieval speed, compromises were made as appropriate to the characteristics of each vocabulary's content: The AAT synonyms and related terms were included, but the "children" were not; the ULAN "project-preferred" synonyms were included, but the full list of variant names was not.

Many commercial Web browsers that use thesauri do so transparently, broadening searches without interaction by the user. While the interface may thus be simple and easy to use, the disadvantage, of course, is that the end-user is denied control of the search criteria. More accurate results are achieved in "a.k.a." when the user hand-picks which vocabulary terms to incorporate in the search.

The advantage is illustrated in the latest version of "a.k.a." In order to make the vocabulary component of the search easier for users, this new version includes fewer screens and easier instructions. Access to all three vocabularies is available on the same screen, whereas in the older version the user was taken to separate screens for each vocabulary.

Figure 10: "a.k.a." http://www.gii.getty.edu/aka (accessed January 15, 1999), revised implementation. User is asked to choose which databases to query, and may go to a vocabulary window.

The application of vocabularies is also more powerful in the newer version of "a.k.a." because the user can link terms from multiple vocabularies with Boolean operators, whereas in the earlier version the user could query with only one vocabulary at a time. Therefore, for example, a user interested in finding a particular type of ancient Egyptian funerary sculpture that is in Cairo can now combine synonyms for "Cairo" from the TGN with synonyms for "ushabti" from the AAT in this statement: ("Al-Qahirah" OR "El Qâhira" OR "Cairo" OR "Le Caire" OR "Kairo") AND ("ushabti" OR "shabti" OR "shawabti" OR "ushabtis" OR "ushabtiu"). Then these terms may be used to query across various databases.

Figure 11: "a.k.a.", vocabulary screen. User is able to pick multiple terms from multiple vocabularies, linked by Boolean operators, and apply these terms to a query across several databases.

Continuing Problems with Retrieval. The outstanding issue that plagues the new "a.k.a." implementation is the same one that affects most searches on the Web: No matter how skillfully terms are gathered, when they are then used to retrieve indiscriminately from texts, results may be unexpected. For the query above, one could retrieve a "ushabti" in Boston because there is a mummy case from Cairo on the same Web page. Often the results are even more disappointing: for example, the word "painter" is part of many terms in all three vocabularies - an occupation can be "painter" in the AAT, an artist's name can be "Painter" in the ULAN, and towns in TGN may be named "Painter." If one searches using that term on a general Web browser, more than 500,000 pages are retrieved, and many have nothing to do with fine art; they include pages for a realtor named "Painter," discussions of environmental problems on Mount Painter, and advertisements for clip art in a product called "Painter."

The most obvious way to improve results is to use the vocabulary terms to search across databases, but limit the search to a single, common field in all the databases. This approach could be used in local or controlled environments, such as at the Canadian Heritage Information Network (CHIN) , where object records from disparate databases are gathered and mapped to a common format. In a situation like this, where each museum may have existing, established standards and conventions, using the AAT - in the case of CHIN, supplemented with French-language equivalents - to help gain access to the material is clearly advantageous. Furthermore, if the data is in a controlled environment and fielded, more accurate results could be ensured by performing queries only on particular fields (e.g., querying on "painter" could be restricted to an artists' role field). In addition, results could be gathered across all fields, but sorted by individual or related fields (e.g., as at CHIN, where they are grouped under user-friendly labels such as "What, Who, When, Where, How").

Figure 12: Canadian Heritage Information Network (http://www.chin.gc.ca/Site_Index/e_site_index.html, accessed January 15, 1999). Users may consult the AAT to gather terms, which are then applied across several CHIN data sets, and the results are grouped into meaningful sets based upon the data field in which the terms were found.

Directing queries at specific information is, of course, more difficult when the data is not fielded. If more Web pages were tagged with metadata in meaningful ways, this could begin to allow queries to be more focused, even in an open environment. Many museums are beginning to experiment with Dublin Core; these metadata tags could allow vocabulary terms to be targeted at particular areas of a Web document. For example, the Dublin Core was recently used to index a "virtual" exhibition at the Getty, and plans are being developed to employ the Dublin Core in other online resources throughout the programs of the J. Paul Getty Trust.

In addition to enhancing access across disparate databases, the Getty vocabularies have also been helpful in providing access to information in narrowly focused databases, as in the interface to the Census of Antique Art and Architecture Known to the Renaissance. However, an issue arises in such implementations that is similar to one found in implementations for collection management systems: Much of the terminology in the vocabulary may be outside the scope of the database, and thus the terms would retrieve nothing. For example, since the Census database has a prescribed focus of art and architecture from classical antiquity and the Renaissance, and since terminology was controlled at the stage of data entry, a list of keywords is not too large to be overwhelming to the user who wishes to browse through it. Another way of dealing with this issue would be to limit queries on the AAT to only relevant hierarchies, or to link the Census keyword concepts to corresponding AAT records.

Figure 13: Census of Antique Art and Architecture Known to the Renaissance interface, designed by Systems Planning (the Census database is only available online to authorized users; however, a demonstration of the interface is available at http://www.systemsplanning.com, accessed January 15, 1999). The interface allows users to query the AAT for terminology or to pick from a list of keywords gathered from the Census data.

A vocabulary can be useful in a narrowly focused database by retrieving concepts based on information other than the term itself. For example, in the Census implementation (which uses the vocabulary component developed for "a.k.a.") the user may search the AAT scope notes as well as the terms themselves. For example, if a user wants to find information on Roman hot baths but does not know the correct term, he can enter "hot bath" and retrieve "caldaria" (and variants "caldarium," "calidaria," and "calidarium") because the scope note contains the keywords "hot AND baths" ("The vapor baths or hot plunges in Roman baths"); the appropriate terms can then be passed to the Census interface and used to search the database for records containing these criteria: "caldaria" OR "caldarium" OR "calidaria" OR "calidarium."

Figure 14: The AAT as accessed from the Census interface. User may search AAT scope notes for "hot AND bath" and choose appropriate terms for querying the Census data, e.g., synonyms and variant parts of speech: "caldaria," "caldarium," "calidaria," and "calidarium."

In addition to all the issues discussed above, a number of other problems can surround an implementation of vocabularies on the Web. Among these is the fact that terms are less meaningful when taken out of context. For example, there are many homonyms in geographic information; thus, querying on a common name from TGN, such as "Paris," may retrieve too many results. If there were a way to add the broader context to the query, this could narrow the results (e.g., for example, adding one or more of the "parents" of Paris, which are Europe, France, and Île-de-France). However, even though TGN will supply the names of the parents of Paris, these parents' names are probably not included in the databases targeted for the query. Likewise, ULAN or AAT terms taken out of context can retrieve incorrect results. For example, in AAT, "Edo" is the name both of an African style and a Japanese style; "stretcher" is a furniture component, a masonry unit, and equipment for mounting and framing; these terms exist as separate concepts in different parts of the AAT hierarchies. In a local or controlled environment, the unique numeric identifier for a concept could provide a link between the material being queried and the vocabulary used to aid retrieval (for example, the seven-digit number 7008038 is the unique identifier of Paris, France in TGN). For instance, the identifier could be placed in the object record by a cataloger, and could thus be linked to the vocabulary and provide extremely accurate retrieval.

Further potential issues with using the vocabularies in implementations can arise from the forms of the terms themselves. All of the usual problems that can apply to retrieval by words generally can apply here, such as failure to retrieve based on differences in case, punctuation, diacritics, etc. These issues are overcome in the Getty vocabulary browsers by creating normalized strings for retrieval (e.g., all names are translated into lower case, and spaces, punctuation and diacritics are removed). Other issues are more unique to the vocabularies themselves. For example, the homographs in AAT are distinguished from each other by parenthetical qualifiers, such as "drum (column component)" and "drum (membranophone)." If these terms are taken verbatim and applied in a broad query on the Web, the results will be limited, because relatively few resources will include both "drum" and "membranophone" on a page discussing drums, and none is likely to have the exact term "drum (membranophone)," parentheses and all. Also, in TGN and ULAN names may be recorded in inverted order, but the resources being queried may list the name in natural order. For example, in ULAN the eighteenth-century architect's name is recorded as "Wren, Christopher" and "Wren, Sir Christopher"; however, a museum Web page is likely to list him as "Christopher Wren." A retrieval interface can start to solve this by looking for keywords "Wren AND Christopher" on the same page. However, one would also retrieve a page about non-native common wrens, Troglodites parvulus, on the islands of St. Christopher-Nevis. In order to avoid the kind of imprecise retrieval that may result from that method, another solution has been applied to the ULAN: In the ULAN browser, algorithms were applied to parse the ULAN artists names and globally create a new set of variant names in natural order, to be used for retrieval.

Conclusion

Although many issues must be considered in order to successfully integrate vocabularies in Web searching assistants and other implementations, vocabularies provide the terminology and hierarchical structures to allow users to retrieve information across disparate databases and in various languages. The Getty vocabularies can provide critical variant names and thesaural relationships that allow improved retrieval of art and cultural heritage information. It is clear that vocabularies are the key to navigating and retrieving meaningful results from the massive amount of largely inchoate information now potentially available in digital form.