Learning From Links: Content And Link Network Analysis
Fabio C. Gouveia, Fundação Oswaldo Cruz (Fiocruz); and Eleonora Kurtenbach, Universidade Federal do Rio de Janeiro (UFRJ), Brazil
One of today's most widely used search engines, Google, is primarily based on the PageRank methodology described by Brin and Page (1998). As part of this methodology of indexing pages, there are some features that could be of great help for those looking to evaluate the impact of a Web site over the Internet. The network of links pointing to a particular URL (link network) can be obtained with a simple search query. In this paper, we present a methodology of analysis of link networks and the content of Web sites.
First of all, we must consider the approach of evaluating a Web site based on the results of a search engine. It must be clear that we must not compare an absolute value of the number of links the Web site has. A more qualitative analysis reveals the network relationship the Web site has developed by means of its strategies of communication, and of the information quality it offers.
Using one of the features included in Google, the list of links to Web sites can be retrieved and analysed. Structural pages and links from the same institution should be discarded from the list and the remaining links separated and analysed. With this valuable data ready, a qualitative examination can be conducted in order to understand the relationship of the Web site to its virtual community. A general evaluation of the kind of information the Web site offers must also be conducted in order to compare it to the kind of links the Web site presents. The content and the link network of the Web site can then be analysed together. The content represents the main objectives of the Web site, while the link network shows the network's contribution to and use of the museum Web.
Keywords: Search Engine, Information Retrieval, Link Network, Museums, World Wide Web
One of the first things we teach to a new Internet user is how to search and find specific information. It is already known that over 85% of Internet traffic is driven directly or indirectly by sites dedicated to providing links from query searches. Each one of these sites uses a methodology of indexation, and we can divide them into two categories: directories and search engines (Hu et al., 2001).
The directories are based on a database construction containing sites evaluated by users. Each Web site analysed is placed in a category according to criteria of evaluation established by the directory. Thus the authority that evaluates the Web site is users following a guideline system. The results depend on the directory rules and system of indexation.
Search engines create their database automatically using programs called “spiders”. The result is a much larger database than the one obtained by directories. The resulting database must then be indexed using some algorithm to sort a search result. One of the first methodologies used by search engines to better sort results was the use of metadata. Metadata are special pieces of information that the content producer placed in a hidden text containing data such as keywords and descriptions of the page content. These metadata then have priority in the definition of the pages' contents indexation. Using this approach, these search engines allow the authority that defines the relevance of the site to be the content producer. On the other hand, the author of the content must include those pieces of data if he wants them to be correctly indexed and the pages to be easily found. The responsibility for the efficiency of the indexation process is then in the hands of the experts.
Bowen (1999) stresses the importance of including metadata in the pages developed on the sites, as well as registering this material in directories and search engines. Despite this, several museum Web sites do not use metadata, nor do they register these items in the search sites, ignoring these resources and their importance.
More recently, some search engines have started to use a new concept of classification, establishing a new paradigm for Internet search. This methodology takes into account the link network of a particular page in order to rank it in a search result. With the use of this methodology, a new displacement of the authority that defines the indexation of a page occurs. The legitimacy of the content is then established by the referees of that page and not by an authority arbitrarily established (directories) or by the knowledge the content producer has of the insertion of metadata (first search engines). It has greater weight and is more easily found than a page that is less frequently linked to other pages.
PageRank and Google
A couple of years after the publication in 1998 of a paper by Lawrence Page and Sergey Brin, two graduate students at Stanford University, their search engine has become the one most used by far worldwide. They named it Google, a name that was chosen because it is a common spelling of googol, or 10100, representing the enormous amount of data that the systems should be able to deal with. One of the facts that counted much in its success was the superior quality of results when they are compared to the results of other search engines at the time. Although there might have been changes in the algorithm, PageRank is still the core part of it.
According to Brin & Page (1998), the system of PageRank helps bring order to the search results based on the idea that a page must have a higher PageRank if there are many pages that point to it, or if there are some high PageRank pages that point to it..
Google (http://www.google.com) is also a great tool for finding images related to a particular term or idea, as it includes the system presented by one of the first search engines, the World Wide Web Worm, developed by Oliver McBryan in 1994. It uses also the differences in the font size or style of a text, being able to give higher weight to the words found on a specific page. One of the objectives of implementing this type of algorithm is to prevent the use of exploits to change the results order, when retrieving information during a search.
For a brief explanation of the PageRank algorithm, let's consider it as a model of user behaviour. Consider that if a hypothetical surfer clicks randomly on links without regard to page content, there will be a probability that he will visit a certain page. The algorithm also considers that the surfer will from time to time jump to a random page on the database. With this approach, the link popularity of the page will be taken into account. The probability then is derived from the link popularity of the page and the number of links to it that each page has.
The Google PageRank's algorithm is then based on the fact that the more one site is referenced by other sites, the greater its relative relevance. Also, if a site has high relevance, each of its links pointing to other sites will have greater weight. These ideas allow a user to get search results with more accuracy than on other search sites.
Searching for Links
The great database that Google has can also be accessed using some special search tags. One of them is the “link:” a tag that gives the list of pages in the database that point to this particular universal resource locator (URL or Web address). These features could be of great help for those looking to evaluate the impact of its Web site over the Internet. The resulting list would be the link network of this particular URL.
In order to reach that network it is necessary to do a search on Google using the tag “link:” with the address of a Web site. The first information obtained is the number of pages that have links to that URL; this data should be treated with caution and never used as a final number. The list of pages pointing to the links should be consolidated, and for this it would be better to change the Google preference to show the maximum results per page (100 results).
The number of links that point to your Web site as an absolute value is just an initial result. For instance, on Web sites of science museums around the world, these results could vary from less than 100 to more than 8,000 pages. You may also find on this list more than one page that is part of the structure of your site or that belongs to pages of the department or institution your museum is part of. These pages should be removed later from the list.
To obtain the list of links pointing to a particular URL, the steps should be as follows:
After that, it is time to make your first evaluation of the link network of your Web site. To do that you must know thoroughly the content that is being offered on-line and the objectives of the creation and promotion of the Web site.
Link Network Analysis
According to Bowen (2000), the most important reason that museums should have a site on the Internet is to construct a set of virtual visitors. This set must have the same importance as the set of real visitors. Considering that, an analysis of the link network of a Web site could give some information about this virtual community and the kind of relationships the Web site is constructing.
It is very important to take into account the quality of links the site is receiving. So, a more qualitative analysis reveals the network relationship the Web site has developed by means of its strategies of communication, and by the quality of the information it offers.
A panorama of the link network can be made to evaluate whether it has been created by institutional communication processes (reflections of the content cross-references from similar institutions and catalogues) or by experiences of visiting the site.
After a look at the list of links, we must separate and count the number of structural pages and links from the same institution. Web sites with less than 100 links could have the majority of them of that type. Then, we must separate and count the number of links from news and other temporary pages that will soon expire. Normally they will be related to recent activities announced by the museum Web site, or recent comments about museum events. This number could be a good barometer of its ability to communicate new activities on the museum sites, but must not be considered as a network of long-term relationship within the Web site. You can take a look on a regular basis at Google results to track the news published about a particular Web site.
With the resulting list we can then perform a qualitative examination in order to understand the relationship of the Web site to its virtual community. Keep in mind that you must cross this with a general evaluation of the kind of information the Web site offers. For instance, museum Web sites that have pages dedicated to education activities should expect to appear in sites of guides for teachers. The content and the link network of the Web site should be analysed together, considering that the content represents the main objectives of the Web site while the link network shows the network result of the museum approach to the Web.
In order to have some basis for that evaluation, consider comparing the results of other Web sites that deal with similar communities. Museums of the same kind and with similar communities could be expected to have similar networks in relation to their complexity and quality. The greater the museum is, the greater the expected network is. Different languages represent different communities, and we can say nearly the same about different countries. For instance, a Brazilian Web site with Portuguese content will be quite limited to its region. Contents in English are more capable of producing a greater network, since they are the majority of pages on-line. The results for a Web site will depend on the country, language and area of activity.
Remember that Search Engines can't index all the pages on the Web. Some of them are not available to the search engines, and there are several reasons for that, including the usage of metadata that ask the spider not to index that particular page. It's important to search using all the different domains the Web site uses. For instance, a Web site that has a domain museum and a domain .com should perform both searches. Nevertheless you should consider that a special issue treated on your site could be bringing people to it directly. This is more common when you have virtual expositions or special pages dedicated to special visitors. Educational sites may directly reference an educator's area, so you have to consider making a search using these URLs.
Another interesting thing you can do is to check if there are pages that still reference your site by an old address. This will only function if you don't have the redirected tag address on that old page, and if it's not a broken link; otherwise Google consolidates these two pages in one result or ignores that page.
Brin & Page (1998) state that Google is not just a search engine. It is also a research tool, with its data being already collected and used by researchers in a wide range of applications.
Search engine optimization is quite important, and could bring more visits and help build a better network of links to the site. It is also important to make your Web site friendly to the search engines to obtain better results. Some of the characteristics your Web site must have are:
On the last aspect, Brin & Page (1998) state that the metadata efforts have largely failed within the Web search engines, referring to their usage as the core data for ranking search results. For the metadata initiative to work, it would have needed the content producers to use the metadata often and for all the pieces to be honest. On the other hand, we must stress that we should keep accurate metadata on our pages, as they could help people find the site even if this effort does not guarantee that the page will be better placed in the search result. We suggest the use of a content management system (CMS) to better address this issue with little effort for the content producer, and with a well indexed Web site as a result. Honeysett (2000) presents an analysis of the use of CMS tools in the creation of a museum Web site.
A good strategy for keeping track of the new links that have been made to your Web site is to use the Google Alert services (http://www.googlealert.com). It is a free service and is very easy to use.
Thelwall (2002) finds that the PageRank information retrieval algorithm alone is not capable of finding the most important Web page in a particular Web site. Sullivan (2004) reports a case of Google Bombing on Search Engine Watch (http://www.searchenginewatch.com), criticizing this search engine. We understand that both analyses deal with the search result for a particular search string. On the presented methodology, what is extracted during the process is the network of links pointing to a particular URL, and so the situations reported can't compromise this methodology.
We think that Google or any future search engine with similar capabilities should be part of the routine of a Web master in order to keep the Web site in good shape.
Brin, S & L. Page (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems 30, 107-117.
Bowen, J. (1999) Time for Renovations: A Survey of Museum Web Sites. In Museums and the Web 1999. Ed. D. Bearman & J. Trant, pp163-172. Archives & Museum Informatics, Pittsburgh, USA.
Bowen, J. (2000) The virtual museum. Museum International 205, v.51, n.1. UNESCO, Paris, France.
Honeysett, N. (2000). Content Management for a Content-Rich Web Site. In: Museums and The Web 2000. Ed. D. Bearman & J. Trant. Pittsburgh, USA. Consulted February 01, 2002. http://www.archimuse.com/mw2002/papers/honeysett/honeysett.html
Hu, W.C., Y. Chen, M.S. Schmalz, & G.X. Ritter. (2001). An overview of the World Wide Web search technologies, In: Proceedings of 5th World Multi-conference on System, Cybernetics and Informatics, SCI2001, Orlando, Florida, July 22-25, 2001.
Thelwall, M. (2003). Can Google´s PageRank be used to find the most important academic Web pages? Journal of Documentation 59, 205-217.
Sullivan, D. (2004). Google's (and Inktomi's) Miserable Failure. Published on January 6, 2004. consulted January 12, 2004. http://www.searchenginewatch.com/sereport/article.php/3296101