Pipeline Integration Specification¶

This document defines how the Discovery subsystem integrates with the DITA Package Processor pipeline.

Discovery does not transform content. The pipeline does not re-discover structure.

They communicate through explicit artifacts, not vibes.

Design Goal¶

The pipeline must be able to:

Adapt to real-world DITA package variation
Make structural assumptions explicit
Refuse unsafe transformations deterministically
Avoid hardcoding corpus-specific heuristics

Discovery enables this by producing a machine-readable structural record that the pipeline can consult.

High-Level Flow¶

Filesystem
   ↓
Discovery Scanner
   ↓
Artifact Metadata
   ↓
Pattern Evaluation
   ↓
Evidence
   ↓
Discovery Report (JSON)
   ↓
Pipeline Planning
   ↓
Transformation Execution

Discovery always runs before the pipeline.

Discovery Output Contract¶

Discovery emits a JSON report containing:

Discovered artifacts
Structural metadata
Pattern evidence
Summary counts
Invariant violations (if any)

This report is:

Read-only
Serializable
Stable across runs
Suitable for version control

The pipeline treats the report as input, not suggestion.

Pipeline Consumption Model¶

The pipeline consumes discovery output in three distinct phases.

Phase 1: Preflight Validation¶

Before any transformation occurs, the pipeline evaluates:

Structural invariants
Required artifact presence
Cardinality constraints

Examples:

Exactly one MAIN map required
At least one executable map present
No orphaned topics (optional)

Behavior¶

Condition	Pipeline Action
All invariants pass	Continue
Non-fatal violations	Warn
Fatal violations	Abort

Discovery never aborts the pipeline. The pipeline decides.

Phase 2: Planning¶

The pipeline uses discovery evidence to plan transformations.

Planning answers questions like:

Which map is the entry point?
Which maps are containers vs executable?
Which topics are glossary material?
Which artifacts should be ignored?

Planning produces a plan, not side effects.

Important Rule¶

Planning is deterministic and explainable.

If the pipeline cannot explain why it is transforming something, it must refuse to proceed.

Phase 3: Execution¶

Only after planning is complete does execution begin.

Execution may:

Normalize structures
Reparent topicrefs
Generate derived maps
Rewrite content safely

Execution must never:

Re-run discovery
Re-evaluate patterns
Guess intent

Discovery results are treated as authoritative for the run.

Evidence Resolution Strategy¶

Discovery may emit multiple conflicting evidence records for a single artifact.

Example:

index.ditamap asserted as MAIN (confidence 0.9)
single_root_map asserted as MAIN (confidence 0.7)

The pipeline resolves conflicts using a resolver policy:

Highest confidence wins
Ties require explicit configuration
Fallback evidence is always lowest priority

Resolvers are explicit, testable, and versioned.

Configuration Hooks¶

The pipeline may be configured to:

Override pattern confidence thresholds
Disable specific pattern IDs
Require specific roles to be present
Treat certain roles as fatal if missing

All overrides are declarative and logged.

There is no hidden behavior.

Failure Philosophy¶

Discovery failures are data. Pipeline failures are decisions.

The pipeline must never silently:

Guess a main map
Invent a glossary
Ignore structural anomalies

If the pipeline proceeds, it does so knowingly.

Non-Goals¶

This integration explicitly does not:

Use discovery to auto-fix packages
Rewrite discovery output
Call LLMs at runtime
Introduce nondeterminism

Discovery is analysis. The pipeline is action.

Summary¶

Discovery answers:

“What is actually here?”

The pipeline answers:

“Given that reality, what can we safely do?”

They are separate on purpose. Confusing them is how tools become untrustworthy.

This integration makes that confusion impossible.