# robots-txt-format — Crawler Directive Format Specification
# KNO Schema Version: 0.1.0
#
# Describes the `robots.txt` format — the de facto standard (RFC 9309)
# for declaring crawler access policy at the site root.
#
# INDUSTRY CONTEXT:
#   Originally proposed by Martijn Koster in 1994 ("Robots Exclusion
#   Protocol"). Standardized as RFC 9309 in September 2022. Recognized
#   by every major search crawler and by AI agent fetchers from OpenAI
#   (GPTBot, ChatGPT-User, OAI-SearchBot), Anthropic (ClaudeBot,
#   Claude-User, Claude-SearchBot), Google (Google-Extended), and others.
#
# POSSIBILITY POSITIONING:
#   Produced by thin transformation from `agent-policy-schema`-conforming
#   entities (see specs/agent-policy-schema.kno). The production rule IS
#   the schema-graph edge `produces: [robots-txt-format]`. No conversion
#   spec exists; the renderer reads the policy entity and emits the file.
#
#   This is a sibling of `llms-txt-format` (specs/llms-txt-format.kno):
#   both are "policy / discovery" artifacts at well-known paths consumed
#   by automated agents. Both are ingest-only (not round-trippable).

# =============================================================================
# SCHEMA DECLARATION
# =============================================================================
$schema: kno@0.0.9

# =============================================================================
# BASIC TIER
# =============================================================================
id: 01KPY2N1SXNM9X0NV4N6EP3E87
slug: robots-txt-format
type: spec
version: 0.1.0

# =============================================================================
# STANDARD TIER
# =============================================================================
title: "robots.txt — Crawler Directive Format"
purpose: |
  Define `robots-txt-format` as a policy-projection output format for
  .kno systems.

  **What is robots.txt?** A plain-text file served at `/robots.txt` (the
  site root) declaring which crawlers may access which paths. Standardized
  as RFC 9309 (September 2022). Every major crawler — search engines, AI
  agents, social media unfurlers — checks this file before fetching.

  **Why a distinct format?** robots.txt is a **policy projection**, not
  a content artifact. It serves a different consumer (crawlers) than
  llms.txt (LLM aggregators) or sitemap.xml (URL index). Modeling it as
  a first-class format makes the producer relationship explicit:
  `agent-policy-schema` declares `produces: [robots-txt-format]` and the
  renderer is a thin transformation.

  **Ingest-only:** Like llms.txt, robots.txt is NOT round-trippable. It
  is a one-way projection from a structured policy entity into a flat
  directive file. Parsing robots.txt back into a policy entity is out of
  scope (and would discard the rich `vendor` / `notes` / `role` metadata
  that exists only in the source).

  **Possibility's implementation:** `content/agent-policy.kno` is the
  single canonical instance. The Astro route at
  `services/pspace-site/src/pages/robots.txt.ts` reads it and emits the
  RFC 9309 text. There is no static `public/robots.txt`.

# =============================================================================
# RICH TIER
# =============================================================================
provenance:
  origin:
    id: 01KPY2N1SXNM9X0NV4N6EP3E87
    timestamp: "2026-04-23T00:00:00Z"
    tool: manual-authoring
    issue: "https://github.com/PossibilityTruthy/possibility-space/issues/1807"

taxonomy:
  topics:
    - file-formats
    - crawler-directives
    - site-policy
    - well-known-paths
    - ai-agents
  keywords:
    - robots.txt
    - rfc-9309
    - user-agent
    - disallow
    - allow
    - sitemap
    - crawler

relationships:
  depends_on:
    - xri: "kno://specs/kno-spec"
      reason: "Conforms to KNO format specification"

  related_to:
    - xri: "kno://specs/llms-txt-format"
      reason: "Sibling well-known artifact for AI agent discovery"
    - xri: "kno://specs/agent-policy-schema"
      reason: "Primary producer of robots.txt artifacts in Possibility"

  enables:
    - xri: "kno://capabilities/crawler-policy-enforcement"
      reason: "Standard mechanism for declaring crawler access policy"

quality:
  completeness: 0.85
  last_reviewed: "2026-04-23"
  review_status: draft
  reviewed_by: "claude"

# =============================================================================
# HISTORY
# =============================================================================
_history:
  retention: full
  format: changelog

  changelog:
    - version: "0.1.0"
      date: "2026-04-23"
      author: "claude"
      summary: "Initial robots-txt-format spec (#1807 Phase 1)"
      changes:
        - "Defined as policy-projection output format, sibling to
          llms-txt-format"
        - "Documented RFC 9309 grammar, well-known path, and ingest-only
          semantics"
        - "Captured production rule via produces: [robots-txt-format] edge on
          agent-policy-schema"

# =============================================================================
# SPECIFICATION CONTENT
# =============================================================================
spec:
  status: Draft

  # ---------------------------------------------------------------------------
  # Industry References
  # ---------------------------------------------------------------------------
  standards:
    - name: "RFC 9309 — Robots Exclusion Protocol"
      url: "https://www.rfc-editor.org/rfc/rfc9309.html"
      author: "M. Koster, G. Illyes, H. Zeller, L. Sassman"
      year: 2022
      description: |
        IETF standardization of the de facto robots.txt protocol. Defines
        the formal grammar, matching rules, and the mandatory User-agent /
        Allow / Disallow directives.

    - name: "Robots Exclusion Protocol (original)"
      url: "https://www.robotstxt.org/orig.html"
      author: "Martijn Koster"
      year: 1994
      description: "Original 1994 specification, widely implemented for decades before
        formal IETF standardization."

  # ---------------------------------------------------------------------------
  # Format Definition
  # ---------------------------------------------------------------------------
  format:
    name: "robots.txt"
    mime_type: "text/plain; charset=utf-8"
    extensions:
      - ".txt"
    encoding: "utf-8"
    round_trippable: false
    category: policy-projection # Sibling of aggregate-corpus (llms-txt)

  # ---------------------------------------------------------------------------
  # File Structure (RFC 9309)
  # ---------------------------------------------------------------------------
  file_structure:
    description: |
      A robots.txt file is a sequence of records, each starting with one
      or more `User-agent:` lines followed by one or more `Allow:` /
      `Disallow:` directives. Records are separated by blank lines.
      `Sitemap:` directives are global and may appear anywhere.

    record:
      required:
        - "One or more `User-agent: <token>` lines"
        - "One or more `Allow:` or `Disallow:` directives"
      optional:
        - "Comment lines starting with `#`"

    global_directives:
      - "Sitemap: <absolute URL>"

    matching_rules: |
      - User-agent matching is case-INsensitive (RFC 9309 § 2.2.1)
      - Path matching is case-SENSITIVE (RFC 9309 § 2.2.2)
      - Most specific User-agent token wins (longest match)
      - Within a record, longest matching directive wins
      - The `*` user-agent matches any agent not otherwise listed

  # ---------------------------------------------------------------------------
  # Production Rules
  # ---------------------------------------------------------------------------
  production:
    rule: |
      A schema declares:
        produces:
          - xri: kno://specs/robots-txt-format
            reason: "Crawler directives materialized via thin transformation"

      The renderer reads the policy entity and emits records:
        1. A header comment naming the source entity (XRI + ULID)
        2. One record per `agents[]` entry (User-agent + Allow/Disallow)
        3. A default record for `*` driven by `default_policy`
        4. Optional `Sitemap:` directive from `sitemap_url`
        5. An LLM-discovery comment pointing at `llms_txt_url` (if set)

    determinism: |
      Output MUST be deterministic given the same input policy. Agent
      order in the rendered file MUST match the order in `agents[]`.
      This makes the file content-addressable for caching.

    well_known_path: "/robots.txt"

  # ---------------------------------------------------------------------------
  # Serving
  # ---------------------------------------------------------------------------
  serving:
    well_known_path: "/robots.txt"
    content_type: "text/plain; charset=utf-8"
    cache_control: |
      robots.txt SHOULD be cacheable. RFC 9309 § 2.4 recommends crawlers
      cache for up to 24 hours. Producers SHOULD set a Cache-Control
      with a content-hash ETag.

  # ---------------------------------------------------------------------------
  # Example Output
  # ---------------------------------------------------------------------------
  examples:
    - title: "robots.txt for Possibility (excerpt)"
      description: "Illustrative output from `content/agent-policy.kno`"
      code: |
        # robots.txt — Possibility
        # Source: kno://content/agent-policy  •  Identity: 01KPY2...
        # Generated from agent-policy-schema via thin transformation
        # See https://possibility.space/docs/agent-discovery for context

        # On-demand agents (user-initiated fetches): ALLOWED
        User-agent: ChatGPT-User
        Allow: /

        User-agent: Claude-User
        Allow: /

        User-agent: PerplexityBot
        Allow: /

        # Training crawlers: BLOCKED (conservative default)
        User-agent: GPTBot
        Disallow: /

        User-agent: ClaudeBot
        Disallow: /

        User-agent: CCBot
        Disallow: /

        User-agent: Google-Extended
        Disallow: /

        # Search engines: BLOCKED until SEO ready
        User-agent: Googlebot
        Disallow: /

        User-agent: Bingbot
        Disallow: /

        # Default: deny everything else
        User-agent: *
        Disallow: /

        # LLM corpus index (well-known: llmstxt.org)
        # https://possibility.space/llms.txt

  # ---------------------------------------------------------------------------
  # Notes
  # ---------------------------------------------------------------------------
  notes: |
    ## Why model robots.txt as a .kno-driven artifact?

    Most sites maintain a static `public/robots.txt`. That works, but:
    - The policy is invisible to other systems (X-Ray, MCP, validation)
    - Adding a new crawler requires an ad-hoc file edit with no schema check
    - Per-tenant overrides require a per-tenant deploy artifact
    - There's no audit trail beyond git blame

    Modeling agent policy as a `.kno` entity makes the policy first-class:
    - Schema validation catches typos in UA tokens or invalid policy values
    - The `produces:` edge makes regeneration automatic
    - X-Ray can show the source policy behind the rendered robots.txt
    - Future per-tenant policies inherit the same schema and renderer

    ## Relationship to llms.txt

    | Aspect | robots.txt | llms.txt |
    |---|---|---|
    | Audience | Crawlers (policy enforcement) | LLM aggregators (content ingest) |
    | Consumer reads it | Before any fetch | When building a corpus |
    | Format | RFC 9309 directives | Markdown corpus |
    | Producer in Possibility | agent-policy-schema | guide-schema (and other collections) |
    | Round-trippable | No | No |

    Both belong at well-known paths and both are content-addressable
    projections of richer .kno sources.

contains:
  - xri: "#identity"
    role: section
    title: "Schema Metadata"
    keywords: [ id, type, version ]
  - xri: "#spec/file_structure"
    role: section
    title: "File Structure"
    keywords: [ user-agent, allow, disallow, sitemap ]
  - xri: "#spec/production"
    role: section
    title: "Production Rules"
    keywords: [ produces, thin-transformation, determinism ]
  - xri: "#spec/serving"
    role: section
    title: "Serving"
    keywords: [ well-known, robots.txt, content-type ]

_index:
  - path: "identity"
    line: 28
    keywords: [ id, robots-txt-format, policy-projection ]
  - path: "spec/file_structure"
    line: 165
    keywords: [ record, user-agent, allow, disallow ]
  - path: "spec/production"
    line: 195
    keywords: [ produces, thin-transformation ]
  - path: "spec/serving"
    line: 225
    keywords: [ well-known-path, content-type, cache-control ]