corpus_mine.py — User Guide
Aggregate repeated structural patterns across an entire Markdown corpus into a single pattern library (YAML or JSON). This script runs the pattern miner on every .md file, then merges results so support counts and examples reflect the whole corpus.
What it does
flowchart LR
A["📁 corpus/\n*.md files"] -->|"rglob_files()"| B["File list"]
B --> C["mine_one(file)\nper file, min_support=1"]
C --> D["PatternLibrary\nper file"]
D --> E["aggregate()\nmerge by level + signature"]
E --> F["PatternLibrary\nwith corpus counts"]
F -->|"min_support filter"| G["📊 corpus.yml"]
C -.->|"--individual-out"| H["📊 per-file/*.yml"]
- Recursively finds all
*.mdfiles under a corpus root. - Mines each file using the Pattern Miner pipeline.
- Merges per-file patterns by
(level, signature)so you get corpus-level counts. - Writes one aggregated output file. Optionally also writes per-file outputs.
Patterns follow the six-level hierarchy: 1. Phrase 2. Line or sentence 3. Block: paragraph, list item, table row, title 4. Chunk 5. Section 6. Document
Prerequisites
- Python 3.9+
- Project layout with
src/mplm/...(or installable package) - Dependencies installed (from repo root):
bash python -m venv .venv source .venv/bin/activate pip install -r requirements.txt
Making imports work
Choose one:
-
Editable install (recommended):
bash pip install -e .Then run normally:python3 corpus_mine.py ... -
No install, src layout: Run from repo root so the script can add
src/tosys.path.
Usage
python3 corpus_mine.py <corpus_root> <output_file> [--min-support N] [--individual-out DIR]
Arguments
-
corpus_root
Directory to search recursively for*.md. -
output_file
Aggregated pattern library path. Use.yml/.yamlor.json.
Options
-
--min-support N
Minimum corpus-level support to keep a pattern (default2).
Internally, each file is mined withmin_support=1to avoid dropping low-frequency patterns before merge. -
--individual-out DIR
Also write per-file pattern libraries intoDIR(same format asoutput_fileextension).
Examples
Aggregate the entire corpus to a single YAML:
python3 corpus_mine.py ./corpus ./out/patterns.yml
Raise the threshold and keep per-file JSON outputs too:
python3 corpus_mine.py ./corpus ./out/corpus.json --min-support 3 --individual-out ./out_per_file
Point to an absolute path corpus and YAML output:
python3 corpus_mine.py "/data/docs" "/data/out/patterns.yml"
Output format
The aggregated file is a PatternLibrary:
meta:
min_support: 2
scope: corpus
patterns:
- id: P-3-1263
level: 3
selector:
engine: dsl
query: "level==3 and sig=='S S'"
structure:
signature: "S S"
support: 47 # total across all documents
examples:
- source: "/abs/path/doc1.md"
lines: [12, 15] # start_line, end_line (inclusive) if known
excerpt: "First sentence. Second sentence..."
- source: "/abs/path/doc8.md"
lines: [44, 44]
excerpt: "Another example..."
centroid: ["S", "S"]
Field notes:
- level: 1–6 as defined above.
- structure.signature: structural “shape” of the node by child units. Common tokens:
- P phrase, S sentence, PARA paragraph block, CHUNK, SEC
- support: total count across the entire corpus after merging.
- examples: up to 5 examples pooled from multiple files.
lines are block-level ranges in the lightweight splitter.
Interpreting signatures
- Paragraph with two sentences →
level: 3,signature: "S S" - Section made of three paragraphs →
level: 5,signature: "PARA PARA PARA" - Sentence with three phrases (comma/semicolon/colon splits) →
level: 2,signature: "P P P"
Use high-support signatures to define templates, lint rules, or editorial guidance.
Relationship to batch_mine.py
corpus_mine.pyproduces one aggregated library (and optionally per-file).batch_mine.pyproduces one output file per document only.
If you need a single, corpus-wide view, use corpus_mine.py. If you also need per-file diagnostics, pass --individual-out.
Performance notes
- Files are processed serially. If you need parallelism for large corpora, it’s safe to wrap the per-file mining step with a thread or process pool (I/O bound). Keep merge in the main process.
- The default sentence/phrase splitter is heuristic and fast. For higher fidelity, integrate spaCy/other NLP, then re-run. Expect slower throughput.
Preparing a corpus (quick checklist)
- Text should be Markdown or plain UTF-8.
- Use headings (
#,##, …) to define sections. - Separate paragraphs with blank lines.
- Lists must start with
-,*, or1.at line start. - Tables: pipe-delimited rows with a header separator.
- Strip boilerplate footers, nav, and volatile tokens to reduce noise.
Downstream uses
- Style linting: treat dominant signatures as preferred shapes; flag deviations.
- Templates: convert frequent patterns into authoring components.
- Drift monitoring: mine regularly and track support changes for top signatures.
- Graph analytics: load patterns into a graph to study co-occurrence across sections.
Troubleshooting
-
“No module named
mplm”
Activate your venv and eitherpip install -e .or run from repo root sosrc/is importable. -
“No .md files found”
Confirmcorpus_rootis correct and contains Markdown files. -
Examples show empty
sourceorlines
Ensure you’re using the upgradedNormalizerandBoundaryDetectorthat attachsourceand line ranges. -
Too many tiny patterns
Increase--min-supportand/or clean boilerplate text.
Change log (script)
- v0.1
Initial release: per-file mining, corpus aggregation with global support, pooled examples, optional per-file outputs.
License
MIT (same as the toolkit).