The Technical GEO Setup Guide: Schema, Robots.txt, and AI Crawler Config

Zach Chmael
Head of Marketing
5 minutes

In This Article
Copy-paste JSON-LD schema templates, robots.txt configs for every AI crawler, and the full technical checklist. Implementation guide, not theory.
Updated
Trusted by 1,000+ teams
Startups use Averi to build
content engines that rank.
TL;DR
🤖 Layer 1 — Robots.txt: Allow all AI crawlers (GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, Google-Extended). Copy-paste config included. 15 minutes.
📋 Layer 2 — Schema: Organization + Article + FAQPage JSON-LD. Sites with complete Tier 1 schema see ~40% more AI Overview appearances. Copy-paste templates for all three included. 2–4 hours initial setup.
⚡ Layer 3 — Performance: FCP under 0.4s = 3x citation probability. Compress images, remove unused JS, enable CDN. 1–4 hours.
📄 llms.txt: Emerging standard for AI communication. Template included. 15 minutes. Low-risk, potential upside.
✅ Full checklist: 30+ items across robots.txt, schema, performance, and additional technical. Run it once, maintain monthly.
🔧 Averi handles per-page technical GEO (Article schema, FAQ schema, content structure) automatically during publishing.

Zach Chmael
CMO, Averi
"We built Averi around the exact workflow we've used to scale our web traffic over 6000% in the last 6 months."
Your content should be working harder.
Averi's content engine builds Google entity authority, drives AI citations, and scales your visibility so you can get more customers.
The Technical GEO Setup Guide: Schema, Robots.txt, and AI Crawler Config
Most GEO guides tell you to "implement schema markup" and "allow AI crawlers" without showing you exactly what to implement.
They describe the what. This guide provides the how — with copy-paste code you can deploy today.
Sites with complete Tier 1 schema see approximately 40% more AI Overview appearances.
Pages with schema markup are 2.8x more likely to be cited by ChatGPT.
Pages with FCP under 0.4 seconds are 3x more likely to be cited.
These aren't content improvements. They're infrastructure improvements that take a few hours to implement and benefit every page on your site permanently.
This guide covers three layers: AI crawler access (robots.txt), structured data (JSON-LD schema), and site performance (speed and technical health). Each section includes the exact code, where to place it, and how to verify it's working.
This is part of the Definitive Guide to Generative Engine Optimization (GEO). The pillar covers the full GEO framework.
This piece is the technical implementation layer.

Layer 1: AI Crawler Access (Robots.txt Configuration)
If AI crawlers can't access your content, they can't cite it. This is the most common technical GEO failure — and the easiest to fix.
AI Crawlers in 2026
Each AI platform operates one or more dedicated web crawlers. These crawlers function independently from Googlebot and Bingbot.
Allowing search engine crawlers does not automatically allow AI crawlers. They must be permitted separately.
Crawler | User-Agent String | Platform | Purpose |
|---|---|---|---|
GPTBot |
| OpenAI | ChatGPT training + search |
OAI-SearchBot |
| OpenAI | ChatGPT Search (live retrieval) |
ChatGPT-User |
| OpenAI | ChatGPT browse mode |
Google-Extended |
| Gemini / AI training | |
PerplexityBot |
| Perplexity | Perplexity search |
ClaudeBot |
| Anthropic | Claude |
Bytespider |
| ByteDance | TikTok AI |
CCBot |
| Common Crawl | Used by many AI systems |
Amazonbot |
| Amazon | Alexa / Amazon AI |
FacebookBot |
| Meta | Meta AI |
AppleBot-Extended |
| Apple | Apple Intelligence |
The Recommended Robots.txt Configuration
For startups pursuing GEO, allow all AI crawlers. The citation benefit outweighs the content access concern.
Replace yourdomain.com with your actual domain.
How to Implement
WordPress: Edit the robots.txt file through your SEO plugin (Yoast → Tools → File Editor, or RankMath → General Settings → Edit robots.txt). Or edit the file directly at your site root via FTP/SFTP.
Webflow: Go to your project settings → SEO tab → Custom robots.txt. Paste the full configuration. Publish.
Framer: Add a robots.txt file through your site settings. Framer supports custom robots.txt content.
How to Verify
After updating, test with these steps:
Visit
yourdomain.com/robots.txtin your browser. Confirm the file displays correctly.In Google Search Console → Settings → Crawl stats → Open report. Check for crawl errors.
Use Google's robots.txt Tester (available in the old Search Console interface) to verify specific user-agents are allowed.
The "Should I Block AI Crawlers?" Decision
Some publishers block AI crawlers to prevent training data scraping. This makes sense for large media companies protecting subscription content. For startups building visibility, blocking AI crawlers means:
ChatGPT can't cite your content (GPTBot/OAI-SearchBot blocked)
Perplexity can't cite your content (PerplexityBot blocked)
Google's AI features can't draw from your content (Google-Extended blocked)
ChatGPT drives 87.4% of all AI referral traffic. Blocking GPTBot eliminates your visibility in the dominant AI discovery channel.
For startups, the trade-off is clear: allow everything.

Layer 2: Structured Data (JSON-LD Schema)
Schema markup tells AI systems what your content is, who wrote it, and what entity it represents. Without schema, AI crawlers must infer this information. With schema, you declare it explicitly.
Tier 1: Essential Schema (Implement First)
These three schema types create the minimum viable structured data layer for GEO.
Organization Schema
Place this in the <head> of every page on your site (typically in your site-wide header template).
Customization guide:
@id: Use your domain +/#organization. This creates a persistent entity identifier.sameAs: List every official profile URL. Each one strengthens entity recognition. Include LinkedIn, Twitter/X, Crunchbase, GitHub, YouTube, and any industry directories.knowsAbout: List 5–8 topics your company has expertise in. These directly inform AI systems about your authority domain. Be specific: "Content Marketing for B2B SaaS Startups" is better than "Marketing."foundingDate: Establishes entity age. Older entities have stronger recognition signals.
Article Schema
Place this in the <head> of every blog post or article page.
Critical fields:
datePublishedanddateModified: Must reflect actual dates. Content freshness is a primary GEO signal. When you update an article, updatedateModified. Don't updatedateModifiedwithout making real content changes — Google's John Mueller has warned against this.authorwithurl: Links the article to a real person page with credentials. AI systems evaluate author authority as part of citation decisions. The author page should exist and include the person's bio, expertise, and other published work.worksForconnecting to your Organization@id: This tells AI that the article author is part of the entity, strengthening the connection between the article, the author, and the organization.
FAQPage Schema
Place this in the <head> of any page with an FAQ section. This is the highest-impact GEO schema.
Implementation notes:
Include every FAQ question from your on-page FAQ section. The schema should mirror the visible content exactly.
Each
textfield should contain the same self-contained answer that appears on the page. Don't put different content in the schema versus the visible page — Google treats this as cloaking.Add as many Question objects as you have FAQ items. 5–7 is the standard for long-form content.
Tier 2: Enhanced Schema (Implement When Ready)
These additional schema types strengthen GEO signals but aren't essential for getting started.
Person Schema (Author Page)
Create a dedicated author page (e.g., yourdomain.com/about/author-name) with Person schema:
This schema connects the author entity to external profiles and expertise areas. AI systems use this to evaluate whether the article author is a credible source on the topic.
HowTo Schema
For step-by-step guides and tutorials:
How to Implement Schema on Each Platform
WordPress:
Install a schema plugin (Schema Pro, Rank Math, or Yoast SEO Premium)
Most plugins auto-generate Article schema from post metadata
Add Organization schema as a custom code snippet in your theme's
<head>(via Appearance → Theme Editor → header.php, or use a plugin like WPCode)For FAQPage schema: use Rank Math's FAQ block or add manually with WPCode
Webflow:
Add JSON-LD as custom code in Project Settings → Custom Code → Head Code (for site-wide Organization schema)
For page-specific Article and FAQ schema: add custom code in each page's settings → Custom Code → Inside <head> tag
Framer:
Add JSON-LD scripts in your site settings under Custom Code → Head
For page-specific schema: use the page-level custom code injection
How to Verify Schema
Google Rich Results Test: Go to search.google.com/test/rich-results. Enter your URL. It shows which schema types are detected and flags any errors.
Schema.org Validator: Go to validator.schema.org. Paste your JSON-LD code directly. It validates syntax and structure.
Google Search Console → Enhancements: After implementation, GSC shows FAQ, Article, and other rich result eligibility across your site. Check for errors weekly for the first month after implementation.
Common schema errors to avoid:
Missing required fields (headline, author, datePublished for Article)
Mismatched content between schema text and visible page content
Invalid date formats (use ISO 8601:
YYYY-MM-DD)Broken URLs in sameAs or image fields
Nested schema referencing an
@idthat doesn't exist on the page

Layer 3: Site Performance for AI Citation
AI crawlers evaluate page speed when selecting sources. Slow pages get skipped even when content quality is high.
The Speed Benchmarks That Matter for GEO
Pages with First Contentful Paint under 0.4 seconds are 3x more likely to be cited by ChatGPT than pages above 1.13 seconds. Pages with INP scores of 0.4–0.5 seconds have 1.6x higher citation chances than those above 1 second.
Target benchmarks:
Metric | Good | Needs Work | Poor |
|---|---|---|---|
First Contentful Paint (FCP) | Under 0.4s | 0.4–1.0s | Over 1.0s |
Largest Contentful Paint (LCP) | Under 2.5s | 2.5–4.0s | Over 4.0s |
Interaction to Next Paint (INP) | Under 200ms | 200–500ms | Over 500ms |
Cumulative Layout Shift (CLS) | Under 0.1 | 0.1–0.25 | Over 0.25 |
Quick Wins for Speed Improvement
These fixes address 80% of speed issues for most startup websites:
Image optimization. Compress all images. Use WebP format. Lazy-load images below the fold. A single uncompressed hero image can add 2+ seconds to LCP.
Remove unused JavaScript. Audit your third-party scripts. Every analytics tag, chat widget, and tracking pixel adds load time. Remove anything you're not actively using. Defer non-critical scripts.
Enable CDN. If your hosting doesn't include a CDN, add Cloudflare (free tier works). CDN caching reduces server response time for users (and crawlers) worldwide.
Minimize render-blocking CSS. Inline critical CSS. Defer non-critical stylesheets. This directly improves FCP.
How to Measure
Google PageSpeed Insights: pagespeed.web.dev. Enter your URL for FCP, LCP, CLS, and INP scores with specific recommendations.
Google Search Console → Core Web Vitals: Shows site-wide performance with pages grouped by status (Good, Needs Improvement, Poor).
WebPageTest.org: Advanced waterfall analysis showing exactly which resources delay loading.
The llms.txt File (Emerging Standard)
A newer convention for communicating directly with AI systems. Place a Markdown file at yourdomain.com/llms.txt that describes your site and its most important content.
Template
Current Status
The llms.txt standard is not universally adopted. Perplexity and some Common Crawl-based systems have shown early support. It's not confirmed that ChatGPT or Google AI read it. Implementation takes 15 minutes and has no downside, so it's worth adding even if the impact is uncertain. Think of it as a free option on future AI crawler behavior.
The Complete Technical GEO Checklist
Run this checklist on your site. Each item takes minutes to hours, not days.
Robots.txt (15 minutes)
☐ robots.txt exists at site root
☐ GPTBot allowed
☐ OAI-SearchBot allowed
☐ ChatGPT-User allowed
☐ PerplexityBot allowed
☐ ClaudeBot allowed
☐ Google-Extended allowed
☐ Sitemap URL included in robots.txt
☐ No blanket Disallow: / rules blocking content directories
Schema Markup (2–4 hours initial setup)
☐ Organization JSON-LD on every page (site-wide header)
☐ @id set for Organization
☐ sameAs includes all brand profile URLs (5+ platforms)
☐ knowsAbout includes 5–8 expertise topics
☐ Article JSON-LD on every blog post
☐ author linked to real person with URL
☐ datePublished and dateModified populated with real dates
☐ publisher references Organization @id
☐ FAQPage JSON-LD on every page with FAQ section
☐ FAQ schema text matches visible page content exactly
☐ Schema validated with Google Rich Results Test (zero errors)
Site Performance (1–4 hours depending on current state)
☐ FCP under 1 second (ideally under 0.4s)
☐ LCP under 2.5 seconds
☐ INP under 200ms
☐ CLS under 0.1
☐ Images compressed and in WebP format
☐ Unused JavaScript removed or deferred
☐ CDN active
☐ HTTPS active (no mixed content warnings)
☐ Mobile responsive
Additional Technical (30 minutes)
☐ Bing Webmaster Tools connected (ChatGPT sources from Bing)
☐ XML sitemap submitted to both Google and Bing
☐ Author page exists with Person schema
☐ llms.txt file placed at site root
☐ No login walls or paywalls on content you want cited
Maintenance Schedule
Technical GEO isn't a one-time setup. It requires periodic maintenance.
Weekly (5 minutes):
Check Google Search Console for new crawl errors or schema validation issues
Monthly (15 minutes):
Verify robots.txt hasn't been overwritten by CMS updates or plugin changes
Check that new blog posts have Article and FAQ schema (some CMS themes drop schema on new templates)
Review Core Web Vitals in GSC for any performance regressions
Quarterly (30 minutes):
Audit
sameAslinks in Organization schema — add any new brand profiles created during the quarterUpdate
knowsAboutif your expertise areas have expandedCheck for new AI crawlers that should be allowed (new crawlers appear regularly)
Re-validate all schema with the Rich Results Test
How Averi Handles Technical GEO
For startups that want the technical GEO layer handled automatically, Averi's content engine builds these elements into the publishing workflow:
Schema generation: Organization schema guidance provided during onboarding.
FAQ structure: Every piece includes a 5–7 question FAQ section with self-contained answers formatted for both human reading and schema extraction.
Content scoring: The 55% SEO / 45% GEO scoring system evaluates structural elements (answer capsules, extractable blocks, factual density) before publishing.
CMS publishing: Direct publishing to WordPress, Webflow, and Framer preserves schema and formatting without manual code insertion.
The robots.txt configuration, site performance optimization, and Organization schema are still site-level tasks that need to be done once on your end.
Averi handles the per-page technical GEO: the Article schema, FAQ schema, and content structure that make each piece citation-ready.
Start a free 14-day trial. No credit card. The technical GEO content layer applies to every piece you publish through the engine.
Related Resources
The Definitive Guide to Generative Engine Optimization (GEO)
Schema Markup for AI Citations: The Technical Implementation Guide
Google AI Overviews Optimization: How to Get Featured in 2026
Beyond Google: How to Get Your Startup Cited by ChatGPT, Perplexity, and AI Search
SEO for Startups: How to Rank Higher Without a Big Budget in 2026
The Reddit-AI Search Connection: How User-Generated Mentions Become LLM Citations
FAQs
What schema markup do I need for GEO?
Three essential types. Organization JSON-LD on every page (establishes your entity with @id, sameAs profile links, and knowsAbout expertise topics). Article JSON-LD on every blog post (with author, dates, and publisher reference). FAQPage JSON-LD on every page with an FAQ section (question-answer pairs matching visible content). Sites with complete Tier 1 schema see approximately 40% more AI Overview appearances. Validate with Google's Rich Results Test after implementation.
Which AI crawlers should I allow in robots.txt?
All of them, if you want AI citations. The critical ones: GPTBot and OAI-SearchBot (ChatGPT), PerplexityBot (Perplexity), ClaudeBot (Claude), and Google-Extended (Google AI/Gemini). ChatGPT drives 87.4% of all AI referral traffic. Blocking GPTBot eliminates your content from the dominant AI discovery channel. The full robots.txt configuration with all AI crawlers is included in this guide. Copy and paste it directly.
Does page speed actually affect AI citations?
Yes. Pages with FCP under 0.4 seconds are 3x more likely to be cited by ChatGPT than pages above 1.13 seconds. AI retrieval systems operate under time constraints. When evaluating multiple candidate pages for citation, slow-loading pages risk being skipped regardless of content quality. Target FCP under 1 second (ideally under 0.4s), LCP under 2.5 seconds, and INP under 200ms. Quick wins: compress images to WebP, remove unused JavaScript, and enable a CDN.
How do I implement schema on WordPress?
Install a schema plugin (Rank Math, Yoast SEO Premium, or Schema Pro). Most auto-generate Article schema from your post metadata. Add Organization JSON-LD as a custom code snippet in your site-wide header using a plugin like WPCode or through Appearance → Theme Editor → header.php. For FAQPage schema, use Rank Math's built-in FAQ block or add JSON-LD manually via WPCode. Validate each page with Google's Rich Results Test after implementation.
What is llms.txt and should I implement it?
llms.txt is an emerging standard for communicating directly with AI systems, similar to how robots.txt communicates with web crawlers. It's a Markdown file placed at your site root that describes your company, expertise, and most important content pages. Perplexity and some Common Crawl-based systems show early support. Implementation takes 15 minutes and has no downside. It's not confirmed that ChatGPT or Google AI read it yet, so treat it as a low-cost option on future AI behavior rather than a required element.
How often do I need to maintain technical GEO setup?
Weekly: 5-minute check of Google Search Console for crawl errors and schema issues. Monthly: 15-minute verification that robots.txt hasn't been overwritten, new posts have proper schema, and Core Web Vitals haven't regressed. Quarterly: 30-minute audit updating sameAs links for new brand profiles, expanding knowsAbout topics, checking for new AI crawlers, and re-validating schema. The initial setup takes 2–4 hours total. Maintenance is minimal after that.
Do I need Bing Webmaster Tools for GEO?
Yes. 73% of ChatGPT's results align with Bing's search results. ChatGPT's retrieval system pulls from Bing's index. If your content isn't indexed on Bing, ChatGPT can't retrieve or cite it. Connect Bing Webmaster Tools (free), submit your sitemap, and verify your content appears in Bing's index. Many site owners focus only on Google and are invisible to the largest AI citation platform because they neglected Bing.






