Web Archiving: Capturing the Human Experience

Assigning Value recently spoke with Trevor Alvord, Curator of 21st-century Mormonism and Western Americana at Brigham Young University. Trevor was tasked with creating BYU’s web archive and has been growing it since his arrival in 2012. He also chairs SAA’s Web Archiving Roundtable Steering Committee.

In this interview, Trevor uses a number of web archiving terms that may be unfamiliar to some readers. The good people at web archiving service, Archive-It, have created a glossary that should help clarify terms. For example, when Trevor mentions crawling he is talking about the act of capturing web content by specific software for future access and use.


Why should archivists consider websites when selecting documentation?

Trevor Alvord: Try not to think of web archiving as so static that only websites or blogs should be captured. Nowadays many web archivists are devoting equal amount of time to social media. By the end of 2014, there was an estimated 3 billion people on earth using the Internet on a daily basis, or about 40% of the world’s population. Ten years ago, it was only 1 billion. In the United States that percentage increases dramatically to about 87% or 270 million people. This means (especially in America) that it is more likely than not that the individual or event any archive is seeking to document has some type of presence online. I believe in the motto of an archive or collection having the greatest value based on its completeness; thus, web archiving should be a natural extension of any archivist’s tool belt as they attempt to piece together a holistic puzzle.

What are some unique considerations we should remember when appraising web content versus other formats?

TA: I like to say that web archiving is preserving the future of history; meaning that archivists, when web archiving, often have to play fortunetellers when appraising a site. We do not have the luxury of looking back into history to appraise the historic value of a site or blog post. The appraisal has to happen in real-time with archivists using their best judgment as they assess value. Also, the popular saying, “everything posted online is there forever,” is not necessarily true. According to the Internet Archive—which has been in the web preservation game for almost twenty years—the average life span of a webpage is about 44 days; just 44 days before that specific URL rots into oblivion. The Internet is highly ephemeral, in part due to the constant push for new and updated information. Think about your own Internet usage, how quickly have you stopped visiting a blog or website that is not updated at least on some type of regular schedule. Eventually, you and probably everyone else cleared that site from your feed. In the 2000s, Kodak Gallery was a popular online photo storage and sharing site that shut down in 2012. Some users, generally not those in the United States, lost all of their photos in the shutdown. This is more common than one would like to think. We are now at a point where many are predicting the fall of the Facebook Empire. If that does happen in five or ten years, what will happen to all of those videos, images, grammatically correct posts, or—more importantly—the social movements that have been spawned on Facebook, and the events and history that it unwittingly documents every day? Certainly there are aspects worth saving, but there is no guarantee that Facebook will do it. We need to act fast, but not so fast that we frantically gather everything. This is where appraising value will really help us.

What are some examples of appraisal criteria used when appraising websites?

TA: Honestly the biggest criterion is how well a site fits your collecting scope. I run into a problem at BYU because I have about 30,000 blogs so far that fit my collecting scope, and not enough space to crawl them all. That could mean that BYU is too ambiguous in our collecting guidelines, but we have an entire religious movement to document. I found several aggregator sites of Mormon-themed blogs and started crawling all of the blogs listed. Sites such as http://www.ldsblogs.org/ have certain requirements in order to be listed on their site, such as frequent postings and thoughtful content. To me, this was as good as a recommendation. Additionally, I often look for “hidden” recommendations: if a posting gets a lot of circulation on social media, or is mentioned in a local/national publication (e.g. newspapers, magazines, etc.), I will crawl that site. I also explore whom these bloggers are discussing, linking to in posts, or listing on their blogrolls. Slowly, I have been able to expand my selection based on this type of appraisal technique.

How can archivists implement some of these criteria, especially if they have limited technology skills and/or support?

TA: The technology skills are not as critical as one might think. Those skills or knowledge will always help, but web archiving can be accomplished by anyone. The Internet Archive’s web archiving service, Archive-It, is modeled to fit all types and expenses of archives, from small Lone Arranger shops, to large repositories. There is no longer any excuse for archivists to ignore web archiving.

Are there some guidelines available to help us decide how frequently to capture the same websites?

TA: To some extent, web archiving is the ‘Wild, Wild West.’ I try to base crawl frequency on how often a site is updating. For example, I run a daily crawl on a Google News feed for Mormonism because the keyword “Mormon” picks up articles frequently. Using the same feed, I crawl the keywords “The Church of Jesus Christ of Latter-day Saints” only once a week because it returns less results. The same goes for the BYU.edu home page, and the BYU news service called Y News. They are the most frequently crawled of all the BYU.edu sites because they change the most regularly.

Traditionally, BYU has collected LDS missionary history heavily. The typical length of missionary service for a Mormon young adult is between 18 and 24 months. This is an area I wanted to document well through the web archive, and it is now one of our strongest collections. If I find a blog or site documenting an individual’s mission experience after that 24-month period of time, I will crawl it once. However, if the individual’s blog is found before they have served that full 24 months, then the blog will be slated for an annual crawl and reviewed in one year.

The Mormon blogosphere is often affectionately and humorously referred to as the Bloggernacle. The term is a reference to the famous Salt Lake City Tabernacle built in 1867. Most sites within the Bloggernacle are peer contributed—multiple authors posting weekly or daily. Sites such as http://timesandseasons.org/, http://www.juvenileinstructor.org/, and http://www.motleyvision.org/ are culturally significant, so I crawl them more frequently.

What other web resources can help archivists develop best practices in web archiving?

TA: One of the best web resources for web archiving is the Weekly web archiving roundup put out by the Web Archiving Roundtable. It literally covers everything in the world of web archiving. In fact, Cliff Hight (interviewer and A & A section member) is one of the main, talented archivists behind pooling these great references together on a weekly basis. I cannot emphasize enough the value of this site as a helpful resource.

You have developed a suggest-a-site page at BYU. How has this option helped build your holdings?

TA: Unfortunately the site has not been used as much as I would like. During the fall, I plan to make a push to get the page linked on as many Bloggernacle sites as I can, and hopefully use the site to make inroads with several faculty groups at BYU.

Do you have any final words you would like to share about archiving web content?

TA: The Internet has become so much more than a way of passing information from one source to another. The documentation of human existence, our humanity—good or bad—is unfolding online. The raw information of the Internet is wrapped with culture, expression, philosophy, and life, as documented in our Facebook posts, YouTube videos, and Flickr and Instagram photos. We live in a time of great opportunity for knowledge and culture, and it could all disappear in a moment. Like the example of Kodak shutting down its online image repository, imagine if YouTube shut down tomorrow. Some other platform would likely rise to take its place, but how many videos would be lost due to a single corporate decision? Yes, we would lose thousands of hours of cat videos, but also countless, priceless moments capturing births, historic events, and even individual’s final thoughts and experiences. Documenting and preserving content on the Internet is a fundamental need and cannot afford to be thought of as trivial or insignificant.

Traditionally archives have been in the business of collecting on the human experience. The Internet has not changed that mission; it has simply shifted the medium by which it is created. Journals and diaries are now blogs, photo albums have become Flickr and Facebook pages, and websites are the new documentary manuscripts. Web archiving is critical and valuable—not for the traditional or ephemeral nature of the content being sought for preservation, nor for the scope or depth of the material—but because the every day human conversation is available to the archivist for the first time in the history of archival practice. It is unbelievable to think what new scholarship will emerge or how history will be shaped by this new medium. Imagine being able to examine the general reaction of the Romans to the assassination of Caesar, or to contrast the opinions of 15th-century Spanish nobility with those of the peasant class in reaction to the discovery of a new world.

In the same way, the Internet has opened up a seemingly limitless amount of knowledge to the masses, while also giving a voice to the masses. For better or for worse, this voice exists, and while not every YouTube comment has redeeming historical value, the idea that history will no longer be controlled by a limited few is a remarkable reality. Web archiving will help make this possible. However, no single entity can capture everything. Hopefully, we at BYU are doing our part by capturing what we have historically documented so well: Mormonism. Preserving the Internet should be the charge of every library, archives, and cultural heritage institution. Together, little by little—or more like byte by byte—we can all contribute to preserving the greatest source of human knowledge and experience that has ever existed. If you or your institution has not yet begun a web archiving program I encourage you to do so—at least within the next 44 days.


If you would like more information, please feel free to contact anyone on the Web Archiving Roundtable Steering Committee. We would love to help you.

Happy crawling!


Cliff Hight is an at-large member of the Acquisitions and Appraisal Section Steering Committee, participates on the Web Archiving Roundtable Best Practices/Toolbox Committee, and is a member of the Dictionary Working Group. He is the university archivist at Kansas State University and holds an MSIS in archives and records administration and an MA in history from the University at Albany, State University of New York.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s