MultiLevel Pattern Miner — How‑To Guide

A practical guide for running the tool, preparing sample corpora, and doing something intelligent with the output.

1) What this tool does

MultiLevel Pattern Miner discovers repeated structural patterns across six levels:

flowchart TD
    D["DOCUMENT (6)\nEntire file"]
    S["SECTION (5)\nHeading-delimited block"]
    C["CHUNK (4)\nGroup of blocks"]
    P["PARAGRAPH (3)\nBlock: paragraph · list item · table row · title"]
    L["LINE / Sentence (2)"]
    PH["PHRASE (1)\nComma/semicolon/colon clause"]
    D --> S --> C --> P --> L --> PH

A pattern is a recurrent shape at a given level, defined by how that node is composed of children at the lower level. Results are saved as YAML or JSON for downstream use.

2) Install and quick smoke test

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pytest -q                    # all green = you're good

If you’ve got a proper pyproject.toml, install it in editable mode. You can run this from the root directory.

pip install -e .

Try a tiny run on provided examples:

python -m mplm.cli preview examples/example1.md
python -m mplm.cli mine examples/example1.md --out patterns.yml --min-support 1

preview prints the parsed hierarchy so you can sanity check boundaries.
mine creates patterns.yml with discovered pattern families.

3) Preparing a sample corpus

Good input makes good patterns. Here's how to prepare corpora that won't waste your time later.

3.1 Scope and goals

Decide the target level(s). Are you after paragraph shapes, list styles, or full‑section layouts? This choice affects how you clean and sample.
Define "pattern" up front. In this tool, a pattern is a shape (e.g., paragraph with three sentences). If you need semantic patterns, plan to augment with NLP later.

3.2 File format and encoding

Prefer UTF‑8 Markdown or plain text. One document per file.
If you have HTML, convert to Markdown first with a reliable converter to preserve headings, lists, and tables.

3.3 Structural cues (help the parser help you)

Headings: Use # H1, ## H2, etc. Sections are derived from headings.
Paragraphs: Separate with blank lines. Lines jammed together become one paragraph.
Lists: Use -, *, or 1. prefixes.
Tables: Use pipe‑delimited rows; header separators intact.
Titles: A top‑level # line is treated as section metadata; if you want it mined as a block, include it as a paragraph too.

3.4 Normalization

Strip boilerplate banners, nav, disclaimers that repeat everywhere but mean nothing.
Remove tracking tokens, build tags, and volatile timestamps.
Unwrap hard line breaks inside paragraphs unless they're meaningful verse/poetry.

3.5 Sampling strategy

Stratify by content type: include examples of each list style, common paragraph shapes, and section archetypes.
Aim for 30-50 documents per content type for stable patterns. For early exploration, 5-10 documents can still reveal low‑hanging fruit.
Keep a hold‑out set to validate whether patterns generalize.

3.6 Rights and safety

Ensure you have the right to process the text. Mask any sensitive content before sharing outputs.

4) Running the miner on real data

4.1 Single file

python -m mplm.cli preview path/to/doc.md
python -m mplm.cli mine path/to/doc.md --out patterns.yml --min-support 2

4.2 Batch over a directory

Use your shell. Example: mine all Markdown under corpus/ and aggregate pattern libraries later.

mkdir -p out
for f in corpus/**/*.md; do
  python -m mplm.cli mine "$f" --out "out/$(basename "$f" .md).yml" --min-support 2
done

4.3 Tuning `--min-support`

1: exploratory, noisy, lots of micro‑patterns.
2-5: typical for small/medium corpora.
10+: large uniform corpora, better precision.

5) Understanding the output

The miner writes a PatternLibrary to YAML/JSON. A pattern looks like:

- id: P-3-0123
  level: 3
  selector:
    engine: dsl
    query: "level==3 and sig=='S S S'"
  structure:
    signature: "S S S"
  support: 17
  examples:
    - source: examples/example1.md
      excerpt: "We went to sea. The sky was low..."
  centroid: ["S", "S", "S"]

5.1 Fields

level: 1-6. 3 is block level (paragraph/list/table/title).
structure.signature: the shape made of child types. Examples:
Paragraph of three sentences → S S S
List item with two sentences → S S
Chunk with two paragraphs and a list item → PARA PARA S
support: how many instances in this file match the signature.
examples: up to three short snippets for quick inspection.
selector: rudimentary query descriptor for matching nodes.

5.2 Signature tokens

flowchart LR
    subgraph leaf["Leaf node (no children)"]
        direction TB
        E["EMPTY\n0 words"]
        XS["XS\n1–4 words"]
        SS["S\n5–11 words"]
        M["M\n12–24 words"]
        LL["L\n25+ words"]
    end
    subgraph nonleaf["Non-leaf node (has children)"]
        direction TB
        P["P — phrase child"]
        SN["S — line/sentence child"]
        PA["PARA — paragraph child"]
        CH["CHUNK — chunk child"]
        SC["SEC — section child"]
    end