# robots-txt-format — Crawler Directive Format Specification # KNO Schema Version: 0.1.0 # # Describes the `robots.txt` format — the de facto standard (RFC 9309) # for declaring crawler access policy at the site root. # # INDUSTRY CONTEXT: # Originally proposed by Martijn Koster in 1994 ("Robots Exclusion # Protocol"). Standardized as RFC 9309 in September 2022. Recognized # by every major search crawler and by AI agent fetchers from OpenAI # (GPTBot, ChatGPT-User, OAI-SearchBot), Anthropic (ClaudeBot, # Claude-User, Claude-SearchBot), Google (Google-Extended), and others. # # POSSIBILITY POSITIONING: # Produced by thin transformation from `agent-policy-schema`-conforming # entities (see specs/agent-policy-schema.kno). The production rule IS # the schema-graph edge `produces: [robots-txt-format]`. No conversion # spec exists; the renderer reads the policy entity and emits the file. # # This is a sibling of `llms-txt-format` (specs/llms-txt-format.kno): # both are "policy / discovery" artifacts at well-known paths consumed # by automated agents. Both are ingest-only (not round-trippable). # ============================================================================= # SCHEMA DECLARATION # ============================================================================= $schema: kno@0.0.9 # ============================================================================= # BASIC TIER # ============================================================================= id: 01KPY2N1SXNM9X0NV4N6EP3E87 slug: robots-txt-format type: spec version: 0.1.0 # ============================================================================= # STANDARD TIER # ============================================================================= title: "robots.txt — Crawler Directive Format" purpose: | Define `robots-txt-format` as a policy-projection output format for .kno systems. **What is robots.txt?** A plain-text file served at `/robots.txt` (the site root) declaring which crawlers may access which paths. Standardized as RFC 9309 (September 2022). Every major crawler — search engines, AI agents, social media unfurlers — checks this file before fetching. **Why a distinct format?** robots.txt is a **policy projection**, not a content artifact. It serves a different consumer (crawlers) than llms.txt (LLM aggregators) or sitemap.xml (URL index). Modeling it as a first-class format makes the producer relationship explicit: `agent-policy-schema` declares `produces: [robots-txt-format]` and the renderer is a thin transformation. **Ingest-only:** Like llms.txt, robots.txt is NOT round-trippable. It is a one-way projection from a structured policy entity into a flat directive file. Parsing robots.txt back into a policy entity is out of scope (and would discard the rich `vendor` / `notes` / `role` metadata that exists only in the source). **Possibility's implementation:** `content/agent-policy.kno` is the single canonical instance. The Astro route at `services/pspace-site/src/pages/robots.txt.ts` reads it and emits the RFC 9309 text. There is no static `public/robots.txt`. # ============================================================================= # RICH TIER # ============================================================================= provenance: origin: id: 01KPY2N1SXNM9X0NV4N6EP3E87 timestamp: "2026-04-23T00:00:00Z" tool: manual-authoring issue: "https://github.com/PossibilityTruthy/possibility-space/issues/1807" taxonomy: topics: - file-formats - crawler-directives - site-policy - well-known-paths - ai-agents keywords: - robots.txt - rfc-9309 - user-agent - disallow - allow - sitemap - crawler relationships: depends_on: - xri: "kno://specs/kno-spec" reason: "Conforms to KNO format specification" related_to: - xri: "kno://specs/llms-txt-format" reason: "Sibling well-known artifact for AI agent discovery" - xri: "kno://specs/agent-policy-schema" reason: "Primary producer of robots.txt artifacts in Possibility" enables: - xri: "kno://capabilities/crawler-policy-enforcement" reason: "Standard mechanism for declaring crawler access policy" quality: completeness: 0.85 last_reviewed: "2026-04-23" review_status: draft reviewed_by: "claude" # ============================================================================= # HISTORY # ============================================================================= _history: retention: full format: changelog changelog: - version: "0.1.0" date: "2026-04-23" author: "claude" summary: "Initial robots-txt-format spec (#1807 Phase 1)" changes: - "Defined as policy-projection output format, sibling to llms-txt-format" - "Documented RFC 9309 grammar, well-known path, and ingest-only semantics" - "Captured production rule via produces: [robots-txt-format] edge on agent-policy-schema" # ============================================================================= # SPECIFICATION CONTENT # ============================================================================= spec: status: Draft # --------------------------------------------------------------------------- # Industry References # --------------------------------------------------------------------------- standards: - name: "RFC 9309 — Robots Exclusion Protocol" url: "https://www.rfc-editor.org/rfc/rfc9309.html" author: "M. Koster, G. Illyes, H. Zeller, L. Sassman" year: 2022 description: | IETF standardization of the de facto robots.txt protocol. Defines the formal grammar, matching rules, and the mandatory User-agent / Allow / Disallow directives. - name: "Robots Exclusion Protocol (original)" url: "https://www.robotstxt.org/orig.html" author: "Martijn Koster" year: 1994 description: "Original 1994 specification, widely implemented for decades before formal IETF standardization." # --------------------------------------------------------------------------- # Format Definition # --------------------------------------------------------------------------- format: name: "robots.txt" mime_type: "text/plain; charset=utf-8" extensions: - ".txt" encoding: "utf-8" round_trippable: false category: policy-projection # Sibling of aggregate-corpus (llms-txt) # --------------------------------------------------------------------------- # File Structure (RFC 9309) # --------------------------------------------------------------------------- file_structure: description: | A robots.txt file is a sequence of records, each starting with one or more `User-agent:` lines followed by one or more `Allow:` / `Disallow:` directives. Records are separated by blank lines. `Sitemap:` directives are global and may appear anywhere. record: required: - "One or more `User-agent: ` lines" - "One or more `Allow:` or `Disallow:` directives" optional: - "Comment lines starting with `#`" global_directives: - "Sitemap: " matching_rules: | - User-agent matching is case-INsensitive (RFC 9309 § 2.2.1) - Path matching is case-SENSITIVE (RFC 9309 § 2.2.2) - Most specific User-agent token wins (longest match) - Within a record, longest matching directive wins - The `*` user-agent matches any agent not otherwise listed # --------------------------------------------------------------------------- # Production Rules # --------------------------------------------------------------------------- production: rule: | A schema declares: produces: - xri: kno://specs/robots-txt-format reason: "Crawler directives materialized via thin transformation" The renderer reads the policy entity and emits records: 1. A header comment naming the source entity (XRI + ULID) 2. One record per `agents[]` entry (User-agent + Allow/Disallow) 3. A default record for `*` driven by `default_policy` 4. Optional `Sitemap:` directive from `sitemap_url` 5. An LLM-discovery comment pointing at `llms_txt_url` (if set) determinism: | Output MUST be deterministic given the same input policy. Agent order in the rendered file MUST match the order in `agents[]`. This makes the file content-addressable for caching. well_known_path: "/robots.txt" # --------------------------------------------------------------------------- # Serving # --------------------------------------------------------------------------- serving: well_known_path: "/robots.txt" content_type: "text/plain; charset=utf-8" cache_control: | robots.txt SHOULD be cacheable. RFC 9309 § 2.4 recommends crawlers cache for up to 24 hours. Producers SHOULD set a Cache-Control with a content-hash ETag. # --------------------------------------------------------------------------- # Example Output # --------------------------------------------------------------------------- examples: - title: "robots.txt for Possibility (excerpt)" description: "Illustrative output from `content/agent-policy.kno`" code: | # robots.txt — Possibility # Source: kno://content/agent-policy • Identity: 01KPY2... # Generated from agent-policy-schema via thin transformation # See https://possibility.space/docs/agent-discovery for context # On-demand agents (user-initiated fetches): ALLOWED User-agent: ChatGPT-User Allow: / User-agent: Claude-User Allow: / User-agent: PerplexityBot Allow: / # Training crawlers: BLOCKED (conservative default) User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: / # Search engines: BLOCKED until SEO ready User-agent: Googlebot Disallow: / User-agent: Bingbot Disallow: / # Default: deny everything else User-agent: * Disallow: / # LLM corpus index (well-known: llmstxt.org) # https://possibility.space/llms.txt # --------------------------------------------------------------------------- # Notes # --------------------------------------------------------------------------- notes: | ## Why model robots.txt as a .kno-driven artifact? Most sites maintain a static `public/robots.txt`. That works, but: - The policy is invisible to other systems (X-Ray, MCP, validation) - Adding a new crawler requires an ad-hoc file edit with no schema check - Per-tenant overrides require a per-tenant deploy artifact - There's no audit trail beyond git blame Modeling agent policy as a `.kno` entity makes the policy first-class: - Schema validation catches typos in UA tokens or invalid policy values - The `produces:` edge makes regeneration automatic - X-Ray can show the source policy behind the rendered robots.txt - Future per-tenant policies inherit the same schema and renderer ## Relationship to llms.txt | Aspect | robots.txt | llms.txt | |---|---|---| | Audience | Crawlers (policy enforcement) | LLM aggregators (content ingest) | | Consumer reads it | Before any fetch | When building a corpus | | Format | RFC 9309 directives | Markdown corpus | | Producer in Possibility | agent-policy-schema | guide-schema (and other collections) | | Round-trippable | No | No | Both belong at well-known paths and both are content-addressable projections of richer .kno sources. contains: - xri: "#identity" role: section title: "Schema Metadata" keywords: [ id, type, version ] - xri: "#spec/file_structure" role: section title: "File Structure" keywords: [ user-agent, allow, disallow, sitemap ] - xri: "#spec/production" role: section title: "Production Rules" keywords: [ produces, thin-transformation, determinism ] - xri: "#spec/serving" role: section title: "Serving" keywords: [ well-known, robots.txt, content-type ] _index: - path: "identity" line: 28 keywords: [ id, robots-txt-format, policy-projection ] - path: "spec/file_structure" line: 165 keywords: [ record, user-agent, allow, disallow ] - path: "spec/production" line: 195 keywords: [ produces, thin-transformation ] - path: "spec/serving" line: 225 keywords: [ well-known-path, content-type, cache-control ]