Arquivo-web-crawler

What is Arquivo-web-crawler?

Arquivo-web-crawler is a specialized web archiving bot operated by Arquivo.pt, Portugal's national web archive service. It functions as a preservation-focused crawler designed to capture and store web content for historical and research purposes. The crawler is built on the Heritrix open-source web archiving platform and identifies itself in server logs with the user agent string Arquivo-web-crawler (compatible; heritrix/3.4.0-20200304 +https://arquivo.pt/faq-crawling).

Unlike commercial search engine crawlers that prioritize text and structured data, Arquivo-web-crawler aims to retrieve complete web pages including all assets (CSS, JavaScript, images) to ensure accurate historical rendering of content. This comprehensive approach enables the faithful recreation of web pages as they existed at specific points in time. The crawler operates on a periodic schedule, with visitation frequency varying based on a website's perceived historical and cultural significance.

Why is Arquivo-web-crawler crawling my site?

Arquivo-web-crawler visits websites as part of Portugal's digital preservation initiative. If you're seeing this crawler in your logs, your site has been identified as content worth preserving for historical and research purposes. The crawler is particularly interested in websites with cultural, educational, or historical significance, especially those within the Portuguese web sphere.

High-traffic or government-domain sites may receive daily crawls, while lesser-known domains might be archived monthly or quarterly. This prioritization reflects Arquivo.pt's mission to preserve culturally significant content while efficiently allocating resources. The crawler's visits are authorized as part of legitimate web archiving efforts, similar to those conducted by other national libraries and the Internet Archive.

What is the purpose of Arquivo-web-crawler?

Arquivo-web-crawler supports Arquivo.pt's mission to preserve the Portuguese web for future generations. The crawler collects web content that would otherwise disappear as websites change or go offline, creating a historical record of the internet's evolution. This preservation work serves researchers, historians, journalists, and the general public who need access to past versions of web content.

The data collected is stored in standardized WARC (Web ARChive) files and made publicly accessible through Arquivo.pt's web interface. This allows users to "travel back in time" and view websites as they appeared in the past. For website owners, this service provides a form of digital preservation without requiring any action on their part. Your content becomes part of a permanent historical record, potentially increasing its long-term impact and accessibility.

How do I block Arquivo-web-crawler?

Arquivo-web-crawler respects the Robots Exclusion Protocol (robots.txt), making it straightforward to control its access to your website. To completely block the crawler from accessing your site, add the following directives to your robots.txt file:

User-agent: Arquivo-web-crawler
Disallow: /

If you want to allow archiving of most content while protecting specific sections, you can use more targeted directives:

User-agent: Arquivo-web-crawler
Disallow: /private-directory/
Disallow: /sensitive-data/
Allow: /

For page-level exclusion, you can also use the ROBOTS meta tag in your HTML:

<meta name="ROBOTS" content="NOARCHIVE, NOFOLLOW" />

This prevents both crawling and archival storage of specific pages. Arquivo.pt honors the NOARCHIVE directive, which some other crawlers might ignore.

Before blocking, consider that allowing archiving contributes to digital preservation and historical documentation. Blocking the crawler means your content won't be preserved in Portugal's national web archive, potentially limiting its historical impact and future accessibility. If you have privacy concerns, Arquivo.pt adheres to EU GDPR requirements and provides takedown procedures for removal requests after crawling.

Something incorrect or have feedback?

Share feedback