corpus_mine.py — User Guide

Aggregate repeated structural patterns across an entire Markdown corpus into a single pattern library (YAML or JSON). This script runs the pattern miner on every .md file, then merges results so support counts and examples reflect the whole corpus.


What it does

flowchart LR
    A["📁 corpus/\n*.md files"] -->|"rglob_files()"| B["File list"]
    B --> C["mine_one(file)\nper file, min_support=1"]
    C --> D["PatternLibrary\nper file"]
    D --> E["aggregate()\nmerge by level + signature"]
    E --> F["PatternLibrary\nwith corpus counts"]
    F -->|"min_support filter"| G["📊 corpus.yml"]
    C -.->|"--individual-out"| H["📊 per-file/*.yml"]
  • Recursively finds all *.md files under a corpus root.
  • Mines each file using the Pattern Miner pipeline.
  • Merges per-file patterns by (level, signature) so you get corpus-level counts.
  • Writes one aggregated output file. Optionally also writes per-file outputs.

Patterns follow the six-level hierarchy: 1. Phrase 2. Line or sentence 3. Block: paragraph, list item, table row, title 4. Chunk 5. Section 6. Document


Prerequisites

  • Python 3.9+
  • Project layout with src/mplm/... (or installable package)
  • Dependencies installed (from repo root): bash python -m venv .venv source .venv/bin/activate pip install -r requirements.txt

Making imports work

Choose one:

  • Editable install (recommended): bash pip install -e . Then run normally: python3 corpus_mine.py ...

  • No install, src layout: Run from repo root so the script can add src/ to sys.path.


Usage

python3 corpus_mine.py <corpus_root> <output_file> [--min-support N] [--individual-out DIR]

Arguments

  • corpus_root
    Directory to search recursively for *.md.

  • output_file
    Aggregated pattern library path. Use .yml/.yaml or .json.

Options

  • --min-support N
    Minimum corpus-level support to keep a pattern (default 2).
    Internally, each file is mined with min_support=1 to avoid dropping low-frequency patterns before merge.

  • --individual-out DIR
    Also write per-file pattern libraries into DIR (same format as output_file extension).


Examples

Aggregate the entire corpus to a single YAML:

python3 corpus_mine.py ./corpus ./out/patterns.yml

Raise the threshold and keep per-file JSON outputs too:

python3 corpus_mine.py ./corpus ./out/corpus.json --min-support 3 --individual-out ./out_per_file

Point to an absolute path corpus and YAML output:

python3 corpus_mine.py "/data/docs" "/data/out/patterns.yml"

Output format

The aggregated file is a PatternLibrary:

meta:
  min_support: 2
  scope: corpus
patterns:
  - id: P-3-1263
    level: 3
    selector:
      engine: dsl
      query: "level==3 and sig=='S S'"
    structure:
      signature: "S S"
    support: 47         # total across all documents
    examples:
      - source: "/abs/path/doc1.md"
        lines: [12, 15] # start_line, end_line (inclusive) if known
        excerpt: "First sentence. Second sentence..."
      - source: "/abs/path/doc8.md"
        lines: [44, 44]
        excerpt: "Another example..."
    centroid: ["S", "S"]

Field notes: - level: 1–6 as defined above. - structure.signature: structural “shape” of the node by child units. Common tokens: - P phrase, S sentence, PARA paragraph block, CHUNK, SEC - support: total count across the entire corpus after merging. - examples: up to 5 examples pooled from multiple files.
lines are block-level ranges in the lightweight splitter.


Interpreting signatures

  • Paragraph with two sentences → level: 3, signature: "S S"
  • Section made of three paragraphs → level: 5, signature: "PARA PARA PARA"
  • Sentence with three phrases (comma/semicolon/colon splits) → level: 2, signature: "P P P"

Use high-support signatures to define templates, lint rules, or editorial guidance.


Relationship to batch_mine.py

  • corpus_mine.py produces one aggregated library (and optionally per-file).
  • batch_mine.py produces one output file per document only.

If you need a single, corpus-wide view, use corpus_mine.py. If you also need per-file diagnostics, pass --individual-out.


Performance notes

  • Files are processed serially. If you need parallelism for large corpora, it’s safe to wrap the per-file mining step with a thread or process pool (I/O bound). Keep merge in the main process.
  • The default sentence/phrase splitter is heuristic and fast. For higher fidelity, integrate spaCy/other NLP, then re-run. Expect slower throughput.

Preparing a corpus (quick checklist)

  • Text should be Markdown or plain UTF-8.
  • Use headings (#, ##, …) to define sections.
  • Separate paragraphs with blank lines.
  • Lists must start with -, *, or 1. at line start.
  • Tables: pipe-delimited rows with a header separator.
  • Strip boilerplate footers, nav, and volatile tokens to reduce noise.

Downstream uses

  • Style linting: treat dominant signatures as preferred shapes; flag deviations.
  • Templates: convert frequent patterns into authoring components.
  • Drift monitoring: mine regularly and track support changes for top signatures.
  • Graph analytics: load patterns into a graph to study co-occurrence across sections.

Troubleshooting

  • “No module named mplm
    Activate your venv and either pip install -e . or run from repo root so src/ is importable.

  • “No .md files found”
    Confirm corpus_root is correct and contains Markdown files.

  • Examples show empty source or lines
    Ensure you’re using the upgraded Normalizer and BoundaryDetector that attach source and line ranges.

  • Too many tiny patterns
    Increase --min-support and/or clean boilerplate text.


Change log (script)

  • v0.1
    Initial release: per-file mining, corpus aggregation with global support, pooled examples, optional per-file outputs.

License

MIT (same as the toolkit).