How Gemini Crawlers Work

Your robots.txt file may be hiding your site from Gemini, and you would never know it from your search rankings. Google runs two distinct crawler families, and a single misplaced directive will quietly cut you off from one without affecting the other.

Google Crawlers

Blocking Google-Extended stops AI training while keeping search indexing intact

Googlebot

User-agent: Googlebot

Crawls pages for Google Search indexing. Blocking this removes your site from Google search results entirely.

User-agent: Googlebot Allow: /

AllowedAllowedBlocked

Search indexingRich snippetsAI training

Google-Extended

User-agent: Google-Extended

Fetches content for Gemini AI model training and Vertex AI. Independent from search — blocking it has no effect on your Google rankings.

User-agent: Google-Extended Disallow: /

BlockedBlockedBlocked

Gemini trainingVertex AISearch indexing

Key insight: These crawlers are independently controllable. You can block Google-Extended to opt out of AI training while keeping Googlebot allowed for search.

Two Crawler Families, One Domain

Google operates a family of crawlers^[1]:

Googlebot indexes pages for Google Search
Google-Extended handles AI training and Gemini real-time answers^[2]
AdsBot evaluates landing page quality for Google Ads
APIs-Google serves Google APIs and internal products

Google-Extended was introduced in September 2023 as a dedicated token for AI training access, independent of search indexing.

How They Differ

Googlebot crawls for search relevance, while Google-Extended crawls for comprehension. It trains Gemini's models and powers AI Overviews. The key distinction is that they respond to different robots.txt directives. A rule targeting Googlebot does not apply to Google-Extended, and vice versa.

How to Control Access

Block only Google-Extended (keep search, opt out of AI): ``User-agent: Google-Extended Disallow: /``

Block all (removes from both search and AI): ``User-agent: * Disallow: /``

Common Mistakes

Accidentally blocking Google-Extended via a catch-all User-agent: * rule without realizing it
Assuming that blocking Google-Extended only affects training, when it also affects Gemini's real-time answers
Using noindex thinking it only affects search (it affects all Google crawlers)

How Scanner Helps

Scanner audits your robots.txt for unintentional Google-Extended blocks. See also How Agent Crawlers Work.

How Gemini Crawlers Work

Googlebot

Google-Extended

Two Crawler Families, One Domain

How They Differ

How to Control Access

Common Mistakes

How Scanner Helps

More from Learn

How Agent Crawlers Work

Structured Data Is Your Site's API for Agents

What Is llms.txt

Googlebot

Google-Extended