What is Applebot-Extended?

Applebot-Extended is a specialized web crawler operated by Apple that was introduced in June 2024 alongside Apple Intelligence at the Worldwide Developers Conference (WWDC). This crawler serves as a secondary agent to the original Applebot, with a specific focus on identifying web content that can be used for training Apple's generative artificial intelligence (AI) models, particularly their large language models (LLMs).

As a web crawler, Applebot-Extended works differently from traditional crawlers. Rather than independently crawling websites, it evaluates content already indexed by the primary Applebot to determine if that content is eligible for inclusion in AI training datasets. This approach creates a separation between content indexing (handled by the original Applebot) and content utilization for AI training (managed by Applebot-Extended).

When Applebot-Extended appears in your server logs, it identifies itself with the user agent string Applebot-Extended. This distinguishes it from the standard Applebot crawler that Apple uses for services like Siri and Spotlight search.

The key characteristic of Applebot-Extended is its post-crawling filtering function. It doesn't directly crawl websites but instead acts as a secondary evaluation mechanism for content already crawled by the primary Applebot.

Why is Applebot-Extended crawling my site?

Applebot-Extended is visiting your site to evaluate if your publicly available web content is suitable for training Apple's AI models. It's specifically looking for high-quality, text-dense content that could enhance the performance of Apple's foundation models.

Since Applebot-Extended works in conjunction with the primary Applebot, the frequency of visits depends on how often the original Applebot crawls your site. It's not crawling independently but rather assessing content that has already been indexed.

The crawler is triggered when Apple's systems determine that your content might be valuable for AI training purposes. This is an authorized crawling activity from Apple, as it respects standard web protocols and provides mechanisms for opting out.

What is the purpose of Applebot-Extended?

The primary purpose of Applebot-Extended is to identify and collect suitable web content that can be used to train and improve Apple's generative AI models. These models power features under the umbrella of "Apple Intelligence," including advanced AI-driven tools in developer environments, services, and consumer-facing products.

The data collected by Applebot-Extended supplements other sources like licensed materials and proprietary datasets that Apple uses to train its foundation models. By focusing on high-quality web content such as academic publications, news articles, and technical documentation, Apple aims to enhance its AI capabilities while minimizing reliance on user-generated or private data.

For website owners, the value proposition is indirect. While there's no immediate benefit from being crawled by Applebot-Extended, your content could potentially contribute to improving AI systems that millions of Apple users interact with daily.

How do I block Applebot-Extended?

Applebot-Extended respects standard robots.txt directives, making it straightforward to control or block. If you wish to prevent Apple from using your website's content for AI training while still allowing it to be indexed for search purposes, you can add the following directive to your robots.txt file:

User-agent: Applebot-Extended
Disallow: /

This directive specifically blocks Applebot-Extended while still allowing the standard Applebot to crawl your site. This means your content can still appear in Apple's search results and services like Siri, but won't be used for training Apple's AI models.

You can also choose to block specific sections of your site rather than the entire domain. For example, to block only a private directory:

User-agent: Applebot-Extended
Disallow: /private/

Blocking Applebot-Extended has no negative impact on your site's visibility in Apple's search services since the primary Applebot is still allowed to crawl and index your content. The only consequence is that your content won't contribute to the training of Apple's AI models.

Applebot-Extended