Google-Extended

What is Google-Extended?

Google-Extended is a specialized control mechanism created by Google that allows website owners to manage whether their content can be used for training Google's AI models. Introduced in September 2023, Google-Extended is not a traditional web crawler but rather a "standalone product token" within Google's crawler ecosystem.

Unlike Googlebot (which indexes content for search results), Google-Extended operates as a directive mechanism within the Robots Exclusion Protocol (REP). It doesn't independently crawl websites but works alongside Google's existing crawling infrastructure to signal whether a site's content may be used for AI training purposes.

Google-Extended specifically governs content access for Google Gemini (formerly Bard) and Vertex AI generative APIs. When Google's systems crawl a website, they check the robots.txt file for Google-Extended directives to determine if the content can be used for training these AI systems.

In server logs, you won't see a distinct user agent string for Google-Extended since it doesn't crawl independently. Instead, the standard Googlebot identifiers appear in logs, while the Google-Extended token in robots.txt determines how that crawled content can be used.

Why is Google-Extended crawling my site?

Google-Extended isn't actually crawling your site separately from Google's regular crawling activities. When Google's regular crawlers (like Googlebot) visit your site, they're collecting content that may be used for different purposes, including search indexing and potentially AI training.

The Google-Extended mechanism simply provides you with control over whether content that's already being crawled can be used specifically for training Google's AI models. If you haven't explicitly blocked Google-Extended in your robots.txt file, Google may use the content crawled from your site to improve its AI systems.

This process happens automatically as part of Google's normal crawling operations. The frequency of these visits depends on your site's crawl budget, which is determined by factors like how often your content changes and your site's importance in Google's ecosystem.

What is the purpose of Google-Extended?

Google-Extended serves as a consent mechanism that gives publishers control over how their content is used in Google's AI ecosystem. Its primary purpose is to allow website owners to decide whether they want to help improve Google's AI models by contributing their content to the training process.

Specifically, Google-Extended controls whether content can be used to train and improve:

  • Google Gemini (Google's conversational AI assistant)
  • Vertex AI generative APIs (Google's developer platform for building AI applications)

This mechanism was created in response to growing concerns from publishers about how their content is being used to train AI systems. By providing this control, Google aims to be transparent about its data collection practices and respect the wishes of content creators.

Importantly, Google-Extended only affects AI training permissions and has no impact on a site's inclusion or ranking in Google Search. This means website owners can opt out of AI training without worrying about negative consequences for their search visibility.

How do I block Google-Extended?

If you don't want your content to be used for training Google's AI models, you can block Google-Extended using your robots.txt file. This is a simple and effective way to opt out of having your content used for AI training purposes.

To block Google-Extended, add these lines to your robots.txt file:

User-agent: Google-Extended
Disallow: /

This directive tells Google not to use any content from your site for training its AI models like Gemini and Vertex AI.

If you want to be more selective, you can block specific directories or files:

User-agent: Google-Extended
Disallow: /private-content/
Disallow: /research-papers/

Remember that blocking Google-Extended only prevents your content from being used for AI training. It does not affect how Google indexes your site for search results or your rankings in Google Search. Your content will still appear in search results as usual.

Google respects the robots.txt directives for Google-Extended, so this method is sufficient for controlling how your content is used. There's no need for additional blocking methods like IP blocking or user-agent blocking at the server level.

Something incorrect or have feedback?
Share feedback
Google-Extended logo

Operated by

Data collector

Documentation

Go to docs

AI model training

Used to train AI or LLMs

Acts on behalf of user

No, operates independently of any user action

Obeys directives

Yes, obeys robots.txt rules

User Agent

Uses standard Googlebot user agent strings