heritrix

What is heritrix?

Heritrix is an open-source, web-scale, archival-quality web crawler developed and maintained by the Internet Archive. The name “Heritrix” (sometimes misspelled as heratrix, heritix, heretix, or heratix) is an archaic word for “heiress,” chosen because the crawler aims to collect and preserve digital artifacts of our culture for future researchers and generations. As the Internet Archive’s primary web crawling tool, Heritrix systematically browses the World Wide Web to create comprehensive snapshots of web content that might otherwise be lost to time.

The crawler was first developed in the early 2000s and has gone through multiple iterations, with the most current version being Heritrix3. It identifies itself in server logs with user agent strings following patterns like Mozilla/5.0 (compatible; heritrix/3.1.1 +http://archive.org) or simply heritrix/[VERSION] (+[OPERATOR_URL]). This identification allows website administrators to recognize the crawler and contact its operators if needed.

Heritrix is designed with a focus on completeness and preservation rather than real-time indexing. It uses a sophisticated crawling architecture that respects robots.txt directives and maintains politeness policies to minimize server load. The crawler stores collected content in Web ARChive (WARC) files, which preserve not just the content but also HTTP headers and metadata essential for accurate digital archiving. These archives ultimately power services like the Wayback Machine, which allows users to view historical versions of websites.

Why is heritrix crawling my site?

Heritrix crawls websites primarily to preserve their content for historical record. If you’re seeing Heritrix in your logs, your site is likely being archived as part of the Internet Archive’s mission to create a digital library of Internet sites and cultural artifacts in digital form. This is a non-commercial, preservation-focused activity.

The crawler typically attempts to capture entire websites, including text, images, JavaScript, CSS, and other elements that make up the complete user experience. It may visit your site periodically, with frequency depending on your site’s visibility, how often it changes, and its perceived cultural or historical significance. Some sites might be crawled daily (news outlets, government sites), while others might be visited monthly or quarterly.

Heritrix crawling is generally considered authorized under fair use principles for archival purposes. The Internet Archive responds to requests for content removal when appropriate, balancing preservation goals with privacy and copyright considerations.

What is the purpose of heritrix?

Heritrix serves the crucial function of digital preservation in an era where web content is constantly changing or disappearing. It powers the Internet Archive’s Wayback Machine, which has preserved billions of web pages dating back to the 1990s. This service provides researchers, historians, journalists, and the general public with access to historical web content that might otherwise be lost.

The data collected by Heritrix is used to:

Create a historical record of the web
Support academic research into internet history and evolution
Provide evidence for legal and journalistic investigations
Preserve cultural heritage that exists primarily or exclusively online
Maintain access to information that may be removed or altered over time

For website owners, Heritrix provides the benefit of historical preservation, ensuring their content remains accessible even if their site changes or goes offline. This can be particularly valuable for documenting organizational history, preserving creative works, or maintaining access to deprecated documentation.

How do I block heritrix?

Heritrix is designed to respect the robots.txt standard, making it straightforward to control its access to your site. If you wish to block Heritrix from crawling your entire site, you can add the following to your robots.txt file:

User-agent: heritrix
Disallow: /

If you only want to block access to specific sections of your site, you can use more targeted directives:

User-agent: heritrix
Disallow: /private/
Disallow: /temporary/
Allow: /

The Internet Archive also honors specific exclusion requests through their removal form. If you have content that has already been archived and you wish to have it removed, you can contact them directly through their website.

Keep in mind that blocking Heritrix means your site won’t be preserved in the Internet Archive’s Wayback Machine, potentially limiting your site’s historical record. For many site owners, a balanced approach is to allow archiving of public content while blocking sensitive or temporary areas. This preserves your digital legacy while respecting privacy and resource concerns. The Internet Archive generally aims to be a good web citizen, implementing politeness policies that limit request rates to avoid overwhelming servers.

Content archiver