LLM-Ready Bridge Schema

Purpose

patterns.yml is a mined evidence artifact. It is useful, but still too raw to act as a stable contract for:

  • template systems
  • generation instructions
  • writer guidance
  • structural analysis
  • output validation

This bridge schema defines a second artifact that sits between mined patterns and downstream LLM tasks. It keeps the miner deterministic while making the next stage explicit.

The implemented compiler produces a first-pass version of this schema. It fills structural and evidentiary fields automatically and adds review blocks wherever human or LLM interpretation is still needed.

Recommended filename:

compiled-pattern-spec.yml

Recommended flow:

source corpus
  -> patterns.yml
  -> compiled-pattern-spec.yml
  -> template / guide / prompt / validator outputs

Design Goals

  1. Preserve traceability back to mined evidence.
  2. Give the LLM named, typed, and role-labeled structures instead of only signatures.
  3. Separate observed facts from interpretive guidance.
  4. Support both human-authored and machine-generated outputs.
  5. Make validation possible after generation.

Top-Level Shape

schema_version: "0.1"
spec_id: "spec-<corpus-or-work-id>"
title: "<human title>"
intent: "<what this spec is for>"

source:
  pattern_library_path: "patterns.yml"
  corpus_label: "<name>"
  source_kind: "single_document | corpus"
  mined_at: "ISO-8601 timestamp"
  min_support: 2
  notes: []

corpus_profile: {}
pattern_catalog: []
template_systems: []
writer_guides: []
generation_programs: []
analysis_observations: []
validation_contracts: []
prompt_assets: []

Common compiler-added convention:

review:
  status: "needs_review"
  fields: ["label", "summary", "intent"]

Section Definitions

1. source

Links the bridge spec to the mined artifact.

source:
  pattern_library_path: "patterns.yml"
  corpus_label: "Azure Stack corpus"
  source_kind: "corpus"
  mined_at: "2026-02-28T15:30:00Z"
  min_support: 2
  notes:
    - "Compiled from mined structural patterns plus manual interpretation."

Fields:

  • pattern_library_path: path to mined YAML.
  • corpus_label: human-readable corpus name.
  • source_kind: single_document or corpus.
  • mined_at: timestamp of the mined artifact used.
  • min_support: copied from patterns.yml.
  • notes: provenance notes.

2. corpus_profile

Summarizes the dominant structural tendencies of the corpus.

corpus_profile:
  dominant_levels:
    - level: 3
      label: "paragraph"
      rationale: "Most reusable variation occurs at paragraph level."
  dominant_patterns:
    - pattern_ref: "pat-para-2sent"
      reason: "High support and strong reuse across documents."
  preferred_section_shapes:
    - "tmpl-three-part-section"
  rhythm_summary:
    short_description: "Short paragraphs, mostly one to two sentences, with clauses chained into compact procedural units."
    observations:
      - "Sections are usually assembled from two to four paragraph blocks."
      - "Sentences tend to resolve in two or three phrase units."
  lexical_style:
    tone: ["instructional", "compressed", "domain-specific"]
    common_terms: ["vm", "cluster", "node"]

Use this for:

  • structural observations
  • corpus summary
  • system-level prompt setup

3. pattern_catalog

This is the core enrichment layer. Each entry maps one or more mined patterns to an interpretable compositional unit.

pattern_catalog:
  - id: "pat-para-2sent"
    source_patterns:
      - pattern_id: "P-3-1263"
        level: 3
        signature: "S S"
        support: 42
    label: "Two-Sentence Paragraph"
    kind: "structural_unit"
    level: 3
    confidence: 0.87
    status: "observed"
    summary: "A compact paragraph that introduces a fact and immediately extends or qualifies it."
    intent:
      functional: "Delivers one bounded idea without digression."
      rhetorical: "Maintains density while preserving readability."
    structure:
      canonical_signature: "S S"
      child_refs:
        - token: "S"
          child_pattern_refs: ["pat-sent-2clause", "pat-sent-3clause"]
          expected_count: 2
      ordering_rules:
        - "Sentence 1 introduces the topic or condition."
        - "Sentence 2 clarifies, qualifies, or extends."
      optionality:
        min_children: 2
        max_children: 2
    semantics:
      role_candidates: ["statement", "qualification", "procedure-step", "observation"]
      discourse_moves:
        - "introduce -> specify"
        - "state -> elaborate"
    evidence:
      support: 42
      support_rank: 3
      representative_examples:
        - source: "docs/a.md"
          lines: [14, 15]
          excerpt: "..."
      lexical_signals:
        top_terms: ["cluster", "vm", "node"]
      distribution:
        appears_in_templates: ["tmpl-three-part-section"]
    slots:
      - id: "slot_subject"
        description: "Main entity being described or acted on."
        required: true
      - id: "slot_action"
        description: "Primary action, state, or condition."
        required: true
      - id: "slot_qualifier"
        description: "Constraint, condition, or elaboration."
        required: false
    guidance:
      do:
        - "Keep the paragraph focused on one topic."
        - "Use the second sentence to sharpen, not redirect."
      avoid:
        - "Adding a third sentence unless the template explicitly allows expansion."
        - "Changing subject matter between sentence 1 and sentence 2."

Required fields:

  • id: stable bridge ID, independent of mined pattern ID format.
  • source_patterns: one or more supporting mined patterns.
  • label: human-readable unit name.
  • kind: structural_unit, rhetorical_unit, or lexical_unit.
  • level: structural level.
  • summary: concise explanation.
  • structure.canonical_signature: normalized signature used downstream.
  • evidence.support: mined support count.

Important distinction:

  • status: observed means directly supported by mining.
  • status: interpreted means a curator or LLM-derived label added on top of evidence.

4. template_systems

Defines reusable document or section templates assembled from catalog patterns.

template_systems:
  - id: "tmpl-three-part-section"
    label: "Three-Part Expository Section"
    scope: "section"
    purpose: "Reusable section shape for explanatory technical prose."
    status: "compiled"
    based_on:
      pattern_refs: ["pat-para-3part"]
    composition:
      source_pattern_ref: "pat-para-3part"
      ordered_slots:
        - position: 1
          token: "PARA"
          child_level: 3
          candidate_pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
          required: true
        - position: 2
          token: "PARA"
          child_level: 3
          candidate_pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
          required: true
        - position: 3
          token: "PARA"
          child_level: 3
          candidate_pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
          required: true
      optional_pattern_refs:
        - "pat-list-procedure"
      repetition_rules:
        - "Allow the middle paragraph pattern to repeat up to 2 additional times."
    semantic_roles:
      - role: "opening"
        pattern_ref: "pat-para-1sent"
      - role: "development"
        pattern_ref: "pat-para-2sent"
      - role: "closure"
        pattern_ref: "pat-para-2sent"
    slots:
      - id: "topic"
        fill_with: "domain entity, process, or decision"
      - id: "constraint"
        fill_with: "limitation, requirement, or condition"
      - id: "outcome"
        fill_with: "result, implication, or next action"
    generation_rules:
      required_roles: ["opening", "development", "closure"]
      ordering_fixed: true
      allow_variation:
        lexical: true
        sentence_count: false
        paragraph_count: true
    output_targets:
      - "template"
      - "writer_guide"
      - "llm_prompt"
      - "validator"

Use this for:

  • document templates
  • section templates
  • compositional recipes

The compiler uses ordered_slots instead of a single hard-coded ordered_pattern_refs list so it can preserve uncertainty where the miner only supports a set of likely child candidates.


5. writer_guides

Translates pattern/template structure into direct instructions for people.

writer_guides:
  - id: "guide-expository-section"
    label: "How to Write a Standard Expository Section"
    audience: "human_writer"
    based_on:
      template_refs: ["tmpl-three-part-section"]
      pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
    instructions:
      - step: 1
        instruction: "Open with a single-sentence paragraph naming the topic."
      - step: 2
        instruction: "Follow with a two-sentence paragraph that adds detail or constraint."
      - step: 3
        instruction: "Close with a final paragraph that resolves the explanation or signals action."
    checks:
      required:
        - "Each paragraph should serve one role only."
      avoid:
        - "Do not merge opening and development into one long block."

Use this for:

  • manual writing guides
  • editorial onboarding
  • house-style instructions

6. generation_programs

Defines how an LLM should generate new text from the compiled structure.

generation_programs:
  - id: "gen-standard-expository-section"
    label: "Generate a Section in Corpus Style"
    objective: "Produce new prose that follows the mined section logic without copying examples."
    inputs:
      required:
        - "topic"
        - "audience"
      optional:
        - "constraints"
        - "target_length"
    template_refs:
      - "tmpl-three-part-section"
    style_constraints:
      preserve_structure: true
      preserve_lexical_domain: true
      forbid_copying_examples: true
      allow_new_content: true
    instructions:
      system: |
        Write using the supplied template and pattern constraints.
        Preserve structural shape and rhetorical order.
        Do not quote or copy evidence examples.
      user: |
        Topic: {{topic}}
        Audience: {{audience}}
        Constraints: {{constraints}}
    output_contract:
      format: "markdown"
      sections:
        - role: "opening"
          expected_pattern_ref: "pat-para-1sent"
        - role: "development"
          expected_pattern_ref: "pat-para-2sent"
        - role: "closure"
          expected_pattern_ref: "pat-para-2sent"

Use this for:

  • corpus expansion
  • style-consistent drafting
  • controlled generation

7. analysis_observations

Stores higher-order claims grounded in mined evidence.

analysis_observations:
  - id: "obs-paragraph-dominance"
    claim: "Paragraph-level structure is more reusable than sentence-level wording."
    type: "structural_observation"
    evidence:
      pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
      support_summary: "The highest-value repeated units occur at level 3."
    implication:
      - "Template work should start at paragraph level."
      - "Writers should vary wording within a stable paragraph scaffold."

Use this for:

  • reports
  • strategic analysis
  • client-facing conclusions

8. validation_contracts

Defines how to check whether generated or manually written output conforms to the mined system.

validation_contracts:
  - id: "val-standard-expository-section"
    applies_to: ["tmpl-three-part-section", "gen-standard-expository-section"]
    template_scope: "section"
    structural_checks:
      - id: "paragraph_count"
        kind: "unit_count"
        rule: "3 <= paragraph_count <= 5"
        expected: 3
        severity: "error"
      - id: "opening_shape"
        kind: "slot_match"
        rule: "paragraph[0] matches pat-para-1sent"
        position: 1
        child_level: 3
        candidate_pattern_refs: ["pat-para-1sent"]
        severity: "error"
      - id: "development_shape"
        kind: "slot_match"
        rule: "at least 1 paragraph matches pat-para-2sent"
        position: 2
        child_level: 3
        candidate_pattern_refs: ["pat-para-2sent"]
        severity: "error"
    lexical_checks:
      - id: "domain_lexicon"
        kind: "contains_terms"
        rule: "contains at least 2 domain terms from corpus_profile.lexical_style.common_terms"
        terms: ["cluster", "node", "vm"]
        min_terms: 2
        severity: "warning"
    policy_checks:
      - id: "example_copying"
        kind: "max_example_overlap"
        rule: "generated text must not overlap any representative example by > 12 consecutive words"
        max_consecutive_words: 12
        severity: "error"
    scoring:
      pass_threshold: 0.85
      weights:
        structural: 0.6
        lexical: 0.2
        policy: 0.2

Use this for:

  • draft scoring
  • conformance checks
  • prompt iteration

The validator reads the machine-oriented fields such as kind, expected, position, candidate_pattern_refs, terms, and max_consecutive_words. The rule string remains for human readability.


9. prompt_assets

Optional prompt fragments built from the structured spec.

prompt_assets:
  - id: "prompt-pattern-summary"
    kind: "system_fragment"
    derived_from:
      pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
    text: |
      Use short, tightly bounded paragraphs.
      Prefer one- and two-sentence paragraph units.
      Let sentence two qualify or extend sentence one.

This keeps prompt engineering attached to explicit structural evidence instead of living outside the repo.


Authoring Rules

To keep this bridge artifact reliable:

  1. Every entry in pattern_catalog must reference at least one mined pattern.
  2. Every template must reference pattern catalog IDs, not raw mined signatures alone.
  3. Every observation must cite pattern_refs or template_refs.
  4. Every generation program should point to at least one validation contract.
  5. Interpretive labels should be additive; do not overwrite mined evidence.

Minimal Viable Spec

If you want to start small, the first useful version only needs:

schema_version: "0.1"
source: {}
corpus_profile: {}
pattern_catalog: []
template_systems: []
generation_programs: []
validation_contracts: []

That is enough to support:

  • named patterns
  • reusable templates
  • generation instructions
  • post-generation checks

The natural next module for this repo is:

patterns.yml -> compiled-pattern-spec.yml

That compiler should do two things:

  1. deterministically lift mined patterns into a richer graph of references
  2. leave labeled fields open for human or LLM curation where interpretation is needed

This keeps the system hybrid:

  • miner = evidence extraction
  • bridge schema = structured interpretation
  • LLM = controlled synthesis