Skip to content

DITA ETL Pipeline

Configuration

Configuration¶

`config/config.yaml` — main pipeline config¶

# External tool paths
tooling:
  pandoc_path: /usr/local/bin/pandoc       # required
  oxygen_scripts_dir: null                 # optional (Oxygen extractor)
  java_path: /usr/bin/java                 # optional (XSLT via Saxon)
  saxon_jar: /opt/saxon/saxon9he.jar       # optional

# Source file extensions to discover
source_formats:
  treat_as_markdown: [".md"]
  treat_as_html:     [".html", ".htm"]
  treat_as_docx:     [".docx"]

# DITA output settings
dita_output:
  output_folder: build/out
  map_title: "My Documentation Set"
  dita_version: "1.3"

# Override which extractor handles an extension
extract:
  handler_overrides:
    ".docx": "oxygen-docx"       # optional

# Topic-type classification rules (evaluated in order)
classification_rules:
  by_filename:
    - match: "guide"
      type: "task"
    - match: "index"
      type: "concept"
    - match: "*reference*"
      type: "reference"
  by_content:
    - match: "procedure"
      type: "task"
    - match: "parameters"
      type: "reference"

`config/assess.yaml` — assessment config¶

# Near-duplicate detection (MinHash)
shingling:
  ngram: 7                  # token n-gram window
  minhash_num_perm: 64      # MinHash permutations
  threshold: 0.88           # Jaccard similarity threshold

# Topicization-readiness scoring weights (sum → 0-100)
scoring:
  topicization_weights:
    heading_ladder_valid: 10
    avg_section_len_target: 15
    tables_simple: 10
    lists_depth_ok: 10
    images_with_alt: 5
  # Conversion-risk scoring weights (sum → 0-100)
  risk_weights:
    deep_nesting: 20
    complex_tables: 25
    unresolved_anchors: 15
    mixed_inline_blocks: 10

# Classification keyword lists
classification:
  task_keywords: ["click", "run", "open", "select", "type", "press"]
  task_landmarks: ["prerequisites", "steps", "results", "troubleshooting"]
  reference_markers: ["parameters", "options", "syntax", "defaults"]

# Near-duplicate handling
duplication:
  prefer_paths: []            # path prefixes to prefer when resolving
  action: "propose"           # "propose" | future: "remove" | "merge"

# Ideal section token range for readiness scoring
limits:
  target_section_tokens: [50, 500]

Environment variables¶

No environment variables are required. All configuration is file-based. Paths in YAML files may be absolute or relative to the working directory.