MultiLevel Pattern Miner — How‑To Guide
A practical guide for running the tool, preparing sample corpora, and doing something intelligent with the output.
1) What this tool does
MultiLevel Pattern Miner discovers repeated structural patterns across six levels:
flowchart TD
D["DOCUMENT (6)\nEntire file"]
S["SECTION (5)\nHeading-delimited block"]
C["CHUNK (4)\nGroup of blocks"]
P["PARAGRAPH (3)\nBlock: paragraph · list item · table row · title"]
L["LINE / Sentence (2)"]
PH["PHRASE (1)\nComma/semicolon/colon clause"]
D --> S --> C --> P --> L --> PH
A pattern is a recurrent shape at a given level, defined by how that node is composed of children at the lower level. Results are saved as YAML or JSON for downstream use.
2) Install and quick smoke test
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pytest -q # all green = you're good
If you’ve got a proper pyproject.toml, install it in editable mode. You can run this from the root directory.
pip install -e .
Try a tiny run on provided examples:
python -m mplm.cli preview examples/example1.md
python -m mplm.cli mine examples/example1.md --out patterns.yml --min-support 1
previewprints the parsed hierarchy so you can sanity check boundaries.minecreatespatterns.ymlwith discovered pattern families.
3) Preparing a sample corpus
Good input makes good patterns. Here's how to prepare corpora that won't waste your time later.
3.1 Scope and goals
- Decide the target level(s). Are you after paragraph shapes, list styles, or full‑section layouts? This choice affects how you clean and sample.
- Define "pattern" up front. In this tool, a pattern is a shape (e.g., paragraph with three sentences). If you need semantic patterns, plan to augment with NLP later.
3.2 File format and encoding
- Prefer UTF‑8 Markdown or plain text. One document per file.
- If you have HTML, convert to Markdown first with a reliable converter to preserve headings, lists, and tables.
3.3 Structural cues (help the parser help you)
- Headings: Use
# H1,## H2, etc. Sections are derived from headings. - Paragraphs: Separate with blank lines. Lines jammed together become one paragraph.
- Lists: Use
-,*, or1.prefixes. - Tables: Use pipe‑delimited rows; header separators intact.
- Titles: A top‑level
#line is treated as section metadata; if you want it mined as a block, include it as a paragraph too.
3.4 Normalization
- Strip boilerplate banners, nav, disclaimers that repeat everywhere but mean nothing.
- Remove tracking tokens, build tags, and volatile timestamps.
- Unwrap hard line breaks inside paragraphs unless they're meaningful verse/poetry.
3.5 Sampling strategy
- Stratify by content type: include examples of each list style, common paragraph shapes, and section archetypes.
- Aim for 30-50 documents per content type for stable patterns. For early exploration, 5-10 documents can still reveal low‑hanging fruit.
- Keep a hold‑out set to validate whether patterns generalize.
3.6 Rights and safety
- Ensure you have the right to process the text. Mask any sensitive content before sharing outputs.
4) Running the miner on real data
4.1 Single file
python -m mplm.cli preview path/to/doc.md
python -m mplm.cli mine path/to/doc.md --out patterns.yml --min-support 2
4.2 Batch over a directory
Use your shell. Example: mine all Markdown under corpus/ and aggregate pattern libraries later.
mkdir -p out
for f in corpus/**/*.md; do
python -m mplm.cli mine "$f" --out "out/$(basename "$f" .md).yml" --min-support 2
done
4.3 Tuning --min-support
- 1: exploratory, noisy, lots of micro‑patterns.
- 2-5: typical for small/medium corpora.
- 10+: large uniform corpora, better precision.
5) Understanding the output
The miner writes a PatternLibrary to YAML/JSON. A pattern looks like:
- id: P-3-0123
level: 3
selector:
engine: dsl
query: "level==3 and sig=='S S S'"
structure:
signature: "S S S"
support: 17
examples:
- source: examples/example1.md
excerpt: "We went to sea. The sky was low..."
centroid: ["S", "S", "S"]
5.1 Fields
- level: 1-6. 3 is block level (paragraph/list/table/title).
-
structure.signature: the shape made of child types. Examples:
-
Paragraph of three sentences →
S S S - List item with two sentences →
S S - Chunk with two paragraphs and a list item →
PARA PARA S - support: how many instances in this file match the signature.
- examples: up to three short snippets for quick inspection.
- selector: rudimentary query descriptor for matching nodes.
5.2 Signature tokens
flowchart LR
subgraph leaf["Leaf node (no children)"]
direction TB
E["EMPTY\n0 words"]
XS["XS\n1–4 words"]
SS["S\n5–11 words"]
M["M\n12–24 words"]
LL["L\n25+ words"]
end
subgraph nonleaf["Non-leaf node (has children)"]
direction TB
P["P — phrase child"]
SN["S — line/sentence child"]
PA["PARA — paragraph child"]
CH["CHUNK — chunk child"]
SC["SEC — section child"]
end