Web Archives: What Can They Do?

Why I Write about Web Archives?

The reason for me to write about Web archiving is quite straight-forward and honest. I started an internship with a Web archiving project a week ago and I have been reading literatures about it. Although I used to get involve with a heritage digitization project, web archiving is something different from what I did before. The downside is I have to start at the beginning. And the upside is it is exciting to learn something new and expand my knowledge in heritage studies. I think I am getting into the Web archiving world, because the more I read and learn about this field, I WOW a lot!

What Is Web Archiving?

But before that, we need to know the Web’s characteristics. According to Julien Masanès, the first characteristic is the Web’s cardinality, which is how many instances of each piece of content exists, the second is the Web considered as an active publishing system and the last one is the Web as a global cultural artefact, its hypermedia and open-publish nature.[1] The Web’s content is vulnerable, because producers can either edit or take it down at any time. They have control over the Web publishing system. Publishing on the Web is available from any place connected to the Internet. Most Web pages are short-lived, thus archiving Web pages before they vanish is a solution to this problem. There is argument against web archiving, for which Julien Masanès classifies it in three categories: the quality of the Web’s content, the ones that consider the Web is self-preserving and the ones that assume archiving the Web impossible. He demonstrates that the argument is unnecessary and web archiving is possible for preserving cultural heritage and social memory of today. As web archiving is different from traditional archiving, new methods and approaches are adapted for the Web preservation. Web archives can fit into the current Internet infrastructure and they are using the same protocols and standards for organizing information and providing access to it.[2] Web archives provide Web memory that is part of the Web itself and restrict the negative impact of the short-lived nature of Web publishing.[3] Three acquisition methods are used for web archiving, which are client-side archiving, transactions archiving, and server-side archiving. These three methods requires crawlers to capture the websites. So how much or how deep the crawlers go is depend on the technology or the selection policy that the initiatives are used.

The Usage of Web Archives: What Can They Do?

Usage of Web archives implies a module building on an existing Web archive that allows access in a similar manner to the way we access the current Web with the additional dimension of the time.[4] There are tools and techniques for web archive usage, supporting access to web archives and the analysis of their content. The popular access of Web archives is from the Internet Archive, with its Wayback Machine. Additional methods for access and analysis add value to a Web archive. In Web archives, extracting information is called Web mining. Web pages, metadata, usage data, and infrastructure data are information extracted from the Web collection. This kind of information can be used to analyse in the Web world for its future development. It can also be used to research about the community network and planning the network infrastructure. The list of the usage is not limited.

Recently, the International Internet Preservation Consortium (IIPC) posts a blog called: Archives Unleashed at the British Library: Study of gender distribution in National Olympic Committees written by the IIPC programme and communications officer. This is one of the examples to show what we can do with Web archives. The blog detailed records the whole set of analysing Web archivies from the start to the end. The blog not only documents about the selection process of the project, it also demonstrates the analytical process. Documentation is necessary for Web archives. As time goes, the original context of the Web archives is lost and there will be no clue for future’s researchers to know why we archive. This blogpost is helpful to see how to develop a Web collection.

The project was done by the IIPC Content Development Group in 2017 for the Web Archiving Week. They chose to focus on the web collection of National Olympic & Paralympic Committees for the research question of “What is the gender distribution of National Olympic Committees?” One thing needs to be noticed that they had changed their research question to “What is the gender distribution of English speaking national Olympic Committees?” due to a number of downloaded files were corrupted that could not be compatible with their analysis tools. For this research question, they used Warcbase, Stanford Named Entity Recognizer (NER), and OpenGenderTracking as tools to analyse the information they extracted from the Web collection. The demonstrated their end result with graphs and proposed alternative research questions that shows another way for mining Web collections.

Not the End Yet…

Although this is the end of this blogpost, what I learn from Web archiving now is that Web archives have lots of potential to be the resource to study our society.

[1] Julien Masanès, “Web Archiving: Issues and Method,” in Web Archiving, ed. Julien Masanès (Berlin: Springer, 2006), 1-53.

[2] Ibidem.

[3] Ibidem.

[4] Andreas Aschenbrenner and Andreas Rauber, “Mining Web Collections,” in Web Archiving, ed. Julien Masanès (Berlin: Springer, 2006), 153-176.

By : kittylin