LLM-Ready Bridge Schema
Purpose
patterns.yml is a mined evidence artifact. It is useful, but still too raw to act as a stable contract for:
- template systems
- generation instructions
- writer guidance
- structural analysis
- output validation
This bridge schema defines a second artifact that sits between mined patterns and downstream LLM tasks. It keeps the miner deterministic while making the next stage explicit.
The implemented compiler produces a first-pass version of this schema. It fills structural and evidentiary fields automatically and adds review blocks wherever human or LLM interpretation is still needed.
Recommended filename:
compiled-pattern-spec.yml
Recommended flow:
source corpus
-> patterns.yml
-> compiled-pattern-spec.yml
-> template / guide / prompt / validator outputs
Design Goals
- Preserve traceability back to mined evidence.
- Give the LLM named, typed, and role-labeled structures instead of only signatures.
- Separate observed facts from interpretive guidance.
- Support both human-authored and machine-generated outputs.
- Make validation possible after generation.
Top-Level Shape
schema_version: "0.1"
spec_id: "spec-<corpus-or-work-id>"
title: "<human title>"
intent: "<what this spec is for>"
source:
pattern_library_path: "patterns.yml"
corpus_label: "<name>"
source_kind: "single_document | corpus"
mined_at: "ISO-8601 timestamp"
min_support: 2
notes: []
corpus_profile: {}
pattern_catalog: []
template_systems: []
writer_guides: []
generation_programs: []
analysis_observations: []
validation_contracts: []
prompt_assets: []
Common compiler-added convention:
review:
status: "needs_review"
fields: ["label", "summary", "intent"]
Section Definitions
1. source
Links the bridge spec to the mined artifact.
source:
pattern_library_path: "patterns.yml"
corpus_label: "Azure Stack corpus"
source_kind: "corpus"
mined_at: "2026-02-28T15:30:00Z"
min_support: 2
notes:
- "Compiled from mined structural patterns plus manual interpretation."
Fields:
pattern_library_path: path to mined YAML.corpus_label: human-readable corpus name.source_kind:single_documentorcorpus.mined_at: timestamp of the mined artifact used.min_support: copied frompatterns.yml.notes: provenance notes.
2. corpus_profile
Summarizes the dominant structural tendencies of the corpus.
corpus_profile:
dominant_levels:
- level: 3
label: "paragraph"
rationale: "Most reusable variation occurs at paragraph level."
dominant_patterns:
- pattern_ref: "pat-para-2sent"
reason: "High support and strong reuse across documents."
preferred_section_shapes:
- "tmpl-three-part-section"
rhythm_summary:
short_description: "Short paragraphs, mostly one to two sentences, with clauses chained into compact procedural units."
observations:
- "Sections are usually assembled from two to four paragraph blocks."
- "Sentences tend to resolve in two or three phrase units."
lexical_style:
tone: ["instructional", "compressed", "domain-specific"]
common_terms: ["vm", "cluster", "node"]
Use this for:
- structural observations
- corpus summary
- system-level prompt setup
3. pattern_catalog
This is the core enrichment layer. Each entry maps one or more mined patterns to an interpretable compositional unit.
pattern_catalog:
- id: "pat-para-2sent"
source_patterns:
- pattern_id: "P-3-1263"
level: 3
signature: "S S"
support: 42
label: "Two-Sentence Paragraph"
kind: "structural_unit"
level: 3
confidence: 0.87
status: "observed"
summary: "A compact paragraph that introduces a fact and immediately extends or qualifies it."
intent:
functional: "Delivers one bounded idea without digression."
rhetorical: "Maintains density while preserving readability."
structure:
canonical_signature: "S S"
child_refs:
- token: "S"
child_pattern_refs: ["pat-sent-2clause", "pat-sent-3clause"]
expected_count: 2
ordering_rules:
- "Sentence 1 introduces the topic or condition."
- "Sentence 2 clarifies, qualifies, or extends."
optionality:
min_children: 2
max_children: 2
semantics:
role_candidates: ["statement", "qualification", "procedure-step", "observation"]
discourse_moves:
- "introduce -> specify"
- "state -> elaborate"
evidence:
support: 42
support_rank: 3
representative_examples:
- source: "docs/a.md"
lines: [14, 15]
excerpt: "..."
lexical_signals:
top_terms: ["cluster", "vm", "node"]
distribution:
appears_in_templates: ["tmpl-three-part-section"]
slots:
- id: "slot_subject"
description: "Main entity being described or acted on."
required: true
- id: "slot_action"
description: "Primary action, state, or condition."
required: true
- id: "slot_qualifier"
description: "Constraint, condition, or elaboration."
required: false
guidance:
do:
- "Keep the paragraph focused on one topic."
- "Use the second sentence to sharpen, not redirect."
avoid:
- "Adding a third sentence unless the template explicitly allows expansion."
- "Changing subject matter between sentence 1 and sentence 2."
Required fields:
id: stable bridge ID, independent of mined pattern ID format.source_patterns: one or more supporting mined patterns.label: human-readable unit name.kind:structural_unit,rhetorical_unit, orlexical_unit.level: structural level.summary: concise explanation.structure.canonical_signature: normalized signature used downstream.evidence.support: mined support count.
Important distinction:
status: observedmeans directly supported by mining.status: interpretedmeans a curator or LLM-derived label added on top of evidence.
4. template_systems
Defines reusable document or section templates assembled from catalog patterns.
template_systems:
- id: "tmpl-three-part-section"
label: "Three-Part Expository Section"
scope: "section"
purpose: "Reusable section shape for explanatory technical prose."
status: "compiled"
based_on:
pattern_refs: ["pat-para-3part"]
composition:
source_pattern_ref: "pat-para-3part"
ordered_slots:
- position: 1
token: "PARA"
child_level: 3
candidate_pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
required: true
- position: 2
token: "PARA"
child_level: 3
candidate_pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
required: true
- position: 3
token: "PARA"
child_level: 3
candidate_pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
required: true
optional_pattern_refs:
- "pat-list-procedure"
repetition_rules:
- "Allow the middle paragraph pattern to repeat up to 2 additional times."
semantic_roles:
- role: "opening"
pattern_ref: "pat-para-1sent"
- role: "development"
pattern_ref: "pat-para-2sent"
- role: "closure"
pattern_ref: "pat-para-2sent"
slots:
- id: "topic"
fill_with: "domain entity, process, or decision"
- id: "constraint"
fill_with: "limitation, requirement, or condition"
- id: "outcome"
fill_with: "result, implication, or next action"
generation_rules:
required_roles: ["opening", "development", "closure"]
ordering_fixed: true
allow_variation:
lexical: true
sentence_count: false
paragraph_count: true
output_targets:
- "template"
- "writer_guide"
- "llm_prompt"
- "validator"
Use this for:
- document templates
- section templates
- compositional recipes
The compiler uses ordered_slots instead of a single hard-coded ordered_pattern_refs list so it can preserve uncertainty where the miner only supports a set of likely child candidates.
5. writer_guides
Translates pattern/template structure into direct instructions for people.
writer_guides:
- id: "guide-expository-section"
label: "How to Write a Standard Expository Section"
audience: "human_writer"
based_on:
template_refs: ["tmpl-three-part-section"]
pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
instructions:
- step: 1
instruction: "Open with a single-sentence paragraph naming the topic."
- step: 2
instruction: "Follow with a two-sentence paragraph that adds detail or constraint."
- step: 3
instruction: "Close with a final paragraph that resolves the explanation or signals action."
checks:
required:
- "Each paragraph should serve one role only."
avoid:
- "Do not merge opening and development into one long block."
Use this for:
- manual writing guides
- editorial onboarding
- house-style instructions
6. generation_programs
Defines how an LLM should generate new text from the compiled structure.
generation_programs:
- id: "gen-standard-expository-section"
label: "Generate a Section in Corpus Style"
objective: "Produce new prose that follows the mined section logic without copying examples."
inputs:
required:
- "topic"
- "audience"
optional:
- "constraints"
- "target_length"
template_refs:
- "tmpl-three-part-section"
style_constraints:
preserve_structure: true
preserve_lexical_domain: true
forbid_copying_examples: true
allow_new_content: true
instructions:
system: |
Write using the supplied template and pattern constraints.
Preserve structural shape and rhetorical order.
Do not quote or copy evidence examples.
user: |
Topic: {{topic}}
Audience: {{audience}}
Constraints: {{constraints}}
output_contract:
format: "markdown"
sections:
- role: "opening"
expected_pattern_ref: "pat-para-1sent"
- role: "development"
expected_pattern_ref: "pat-para-2sent"
- role: "closure"
expected_pattern_ref: "pat-para-2sent"
Use this for:
- corpus expansion
- style-consistent drafting
- controlled generation
7. analysis_observations
Stores higher-order claims grounded in mined evidence.
analysis_observations:
- id: "obs-paragraph-dominance"
claim: "Paragraph-level structure is more reusable than sentence-level wording."
type: "structural_observation"
evidence:
pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
support_summary: "The highest-value repeated units occur at level 3."
implication:
- "Template work should start at paragraph level."
- "Writers should vary wording within a stable paragraph scaffold."
Use this for:
- reports
- strategic analysis
- client-facing conclusions
8. validation_contracts
Defines how to check whether generated or manually written output conforms to the mined system.
validation_contracts:
- id: "val-standard-expository-section"
applies_to: ["tmpl-three-part-section", "gen-standard-expository-section"]
template_scope: "section"
structural_checks:
- id: "paragraph_count"
kind: "unit_count"
rule: "3 <= paragraph_count <= 5"
expected: 3
severity: "error"
- id: "opening_shape"
kind: "slot_match"
rule: "paragraph[0] matches pat-para-1sent"
position: 1
child_level: 3
candidate_pattern_refs: ["pat-para-1sent"]
severity: "error"
- id: "development_shape"
kind: "slot_match"
rule: "at least 1 paragraph matches pat-para-2sent"
position: 2
child_level: 3
candidate_pattern_refs: ["pat-para-2sent"]
severity: "error"
lexical_checks:
- id: "domain_lexicon"
kind: "contains_terms"
rule: "contains at least 2 domain terms from corpus_profile.lexical_style.common_terms"
terms: ["cluster", "node", "vm"]
min_terms: 2
severity: "warning"
policy_checks:
- id: "example_copying"
kind: "max_example_overlap"
rule: "generated text must not overlap any representative example by > 12 consecutive words"
max_consecutive_words: 12
severity: "error"
scoring:
pass_threshold: 0.85
weights:
structural: 0.6
lexical: 0.2
policy: 0.2
Use this for:
- draft scoring
- conformance checks
- prompt iteration
The validator reads the machine-oriented fields such as kind, expected,
position, candidate_pattern_refs, terms, and max_consecutive_words.
The rule string remains for human readability.
9. prompt_assets
Optional prompt fragments built from the structured spec.
prompt_assets:
- id: "prompt-pattern-summary"
kind: "system_fragment"
derived_from:
pattern_refs: ["pat-para-1sent", "pat-para-2sent"]
text: |
Use short, tightly bounded paragraphs.
Prefer one- and two-sentence paragraph units.
Let sentence two qualify or extend sentence one.
This keeps prompt engineering attached to explicit structural evidence instead of living outside the repo.
Authoring Rules
To keep this bridge artifact reliable:
- Every entry in
pattern_catalogmust reference at least one mined pattern. - Every template must reference pattern catalog IDs, not raw mined signatures alone.
- Every observation must cite
pattern_refsortemplate_refs. - Every generation program should point to at least one validation contract.
- Interpretive labels should be additive; do not overwrite mined evidence.
Minimal Viable Spec
If you want to start small, the first useful version only needs:
schema_version: "0.1"
source: {}
corpus_profile: {}
pattern_catalog: []
template_systems: []
generation_programs: []
validation_contracts: []
That is enough to support:
- named patterns
- reusable templates
- generation instructions
- post-generation checks
Recommended Next Compiler Step
The natural next module for this repo is:
patterns.yml -> compiled-pattern-spec.yml
That compiler should do two things:
- deterministically lift mined patterns into a richer graph of references
- leave labeled fields open for human or LLM curation where interpretation is needed
This keeps the system hybrid:
- miner = evidence extraction
- bridge schema = structured interpretation
- LLM = controlled synthesis