Assess API¶
config¶
dita_etl.assess.config
¶
Assessment-stage configuration dataclasses.
Loaded from config/assess.yaml once at startup and passed immutably
through the assessment pipeline.
AssessConfig
dataclass
¶
Root configuration object for the assessment stage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
intermediate
|
str
|
Intermediate format name used in reports. |
'docbook5'
|
shingling
|
Shingling
|
MinHash shingling parameters. |
Shingling()
|
scoring
|
ScoringWeights
|
Scoring weights for readiness and risk. |
ScoringWeights()
|
classification
|
dict[str, list[str]]
|
Keyword lists used by the topic-type predictor. |
(lambda: {'task_keywords': ['click', 'run', 'open', 'select', 'type', 'press'], 'task_landmarks': ['prerequisites', 'steps', 'results', 'troubleshooting'], 'reference_markers': ['parameters', 'options', 'syntax', 'defaults']})()
|
duplication
|
Duplication
|
Near-duplicate handling settings. |
Duplication()
|
limits
|
Limits
|
Content length thresholds. |
Limits()
|
Source code in dita_etl/assess/config.py
load(path)
staticmethod
¶
Load an assessment configuration from a YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the YAML configuration file. |
required |
Returns:
| Type | Description |
|---|---|
'AssessConfig'
|
Populated :class: |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If path does not exist. |
Source code in dita_etl/assess/config.py
Shingling
dataclass
¶
MinHash shingling parameters for near-duplicate detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stopwords
|
str | None
|
Optional path to a stopword list (currently unused). |
None
|
ngram
|
int
|
Size of each shingle (token n-gram window). |
7
|
minhash_num_perm
|
int
|
Number of permutations for the MinHash signature. |
64
|
threshold
|
float
|
Jaccard similarity threshold above which two documents are considered near-duplicates. |
0.88
|
Source code in dita_etl/assess/config.py
ScoringWeights
dataclass
¶
Weights used by the topicization-readiness and conversion-risk scorers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
topicization_weights
|
dict[str, int]
|
Per-metric additive weights for the readiness score (0-100). |
(lambda: {'heading_ladder_valid': 10, 'avg_section_len_target': 15, 'tables_simple': 10, 'lists_depth_ok': 10, 'images_with_alt': 5})()
|
risk_weights
|
dict[str, int]
|
Per-metric additive weights for the risk score (0-100). |
(lambda: {'deep_nesting': 20, 'complex_tables': 25, 'unresolved_anchors': 15, 'mixed_inline_blocks': 10})()
|
Source code in dita_etl/assess/config.py
Limits
dataclass
¶
Content length thresholds for scoring.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_section_tokens
|
list[int]
|
|
(lambda: [50, 500])()
|
Source code in dita_etl/assess/config.py
Duplication
dataclass
¶
Near-duplicate handling settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prefer_paths
|
list[str]
|
Path prefixes that should be preferred when resolving duplicate clusters. |
list()
|
action
|
str
|
What to do with detected duplicates — |
'propose'
|
Source code in dita_etl/assess/config.py
structure¶
dita_etl.assess.structure
¶
Markdown structural analysis — pure functions.
Provides sectionization and structural-validity checks for Markdown source documents. All functions are pure: they take text and return data structures with no I/O or side effects.
sectionize_markdown(text)
¶
Split a Markdown document into logical sections at heading boundaries.
Each section is represented as a dictionary with keys:
"level"– heading depth (1–6; 0 for the implicit preamble section)."title"– heading text, or"Document"for the preamble."content"– body text between this heading and the next.
:Example:
.. code-block:: python
secs = sectionize_markdown("# Intro\n\nHello\n## Details\n\nMore")
assert secs[0]["title"] == "Intro"
assert secs[1]["title"] == "Details"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Raw Markdown source text. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
Ordered list of section dictionaries. |
Source code in dita_etl/assess/structure.py
heading_ladder_valid(sections)
¶
Check that heading levels do not skip more than one level at a time.
A document that jumps from ## directly to #### (skipping ###)
is considered invalid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sections
|
list[dict[str, Any]]
|
Section list as returned by :func: |
required |
Returns:
| Type | Description |
|---|---|
bool
|
|
Source code in dita_etl/assess/structure.py
features¶
dita_etl.assess.features
¶
Section feature extraction — pure functions.
Computes a feature vector for a single Markdown section. All functions are pure: identical inputs always produce identical outputs with no side effects.
count_tokens(text)
¶
Count word tokens in text using a simple \w+ pattern.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Number of token matches. |
imperative_density(text, verbs)
¶
Compute the ratio of imperative verb occurrences to total token count.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
verbs
|
list[str]
|
List of imperative verbs to search for (case-insensitive). |
required |
Returns:
| Type | Description |
|---|---|
float
|
Ratio between 0.0 and 1.0. |
Source code in dita_etl/assess/features.py
extract_features(section, landmarks)
¶
Compute a feature dictionary for a single document section.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
section
|
dict[str, Any]
|
Section dict with |
required |
landmarks
|
dict[str, list[str]]
|
Classification keyword lists from
:class: |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary of feature names to scalar values. Returned keys:
|
Source code in dita_etl/assess/features.py
scoring¶
dita_etl.assess.scoring
¶
Document-readiness and conversion-risk scorers — pure functions.
All functions are pure: no I/O, no side effects.
score_topicization(metrics, weights, target_range)
¶
Compute a topicization-readiness score in the range 0–100.
Higher values indicate a document that is well-structured for conversion to DITA topics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
dict[str, Any]
|
Metrics dictionary as produced by
:func: |
required |
weights
|
dict[str, int]
|
Per-metric additive weights from
:class: |
required |
target_range
|
list[int]
|
|
required |
Returns:
| Type | Description |
|---|---|
int
|
Integer readiness score clamped to [0, 100]. |
Source code in dita_etl/assess/scoring.py
score_risk(metrics, weights)
¶
Compute a conversion-risk score in the range 0–100.
Higher values indicate a document with structural patterns that are difficult to convert reliably.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
dict[str, Any]
|
Metrics dictionary as produced by
:func: |
required |
weights
|
dict[str, int]
|
Per-metric additive weights from
:class: |
required |
Returns:
| Type | Description |
|---|---|
int
|
Integer risk score clamped to [0, 100]. |
Source code in dita_etl/assess/scoring.py
predict¶
dita_etl.assess.predict
¶
Topic-type prediction for individual sections — pure functions.
Split from scoring.py to give prediction its own single-responsibility
module with a clear, testable interface.
predict_topic_type(section_feats, landmarks)
¶
Predict the DITA topic type for a single section based on its features.
Rules are evaluated in priority order:
- Task: ordered list present and (imperative density > 0.005 or steps-style title detected).
- Reference: tables present or reference-marker keywords found.
- Concept: default fallback.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
section_feats
|
dict[str, Any]
|
Feature dictionary as returned by
:func: |
required |
landmarks
|
dict[str, list[str]]
|
Classification keyword lists (unused in current heuristic but retained for future expansion). |
required |
Returns:
| Type | Description |
|---|---|
tuple[str, float, list[str]]
|
Tuple of |
Source code in dita_etl/assess/predict.py
dedupe¶
dita_etl.assess.dedupe
¶
Near-duplicate detection via MinHash — pure functions.
Uses token n-gram shingling and MinHash signatures to efficiently cluster documents that are likely near-duplicates without comparing every pair of full texts.
All functions are pure: no I/O, no side effects.
shingle_tokens(text, n=7)
¶
Tokenise text and return all overlapping n-gram shingles.
:Example:
.. code-block:: python
shingles = shingle_tokens("the quick brown fox", n=2)
# ["the quick", "quick brown", "brown fox"]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input document text. |
required |
n
|
int
|
Shingle size (token n-gram window). |
7
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of n-gram strings, lower-cased and space-separated. |
Source code in dita_etl/assess/dedupe.py
minhash_signature(shingles, num_perm=128)
¶
Compute a MinHash signature for a set of shingles.
Uses BLAKE2b with per-permutation personalisation bytes as a fast, independent hash family.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shingles
|
list[str]
|
List of shingle strings. |
required |
num_perm
|
int
|
Number of hash permutations (signature length). |
128
|
Returns:
| Type | Description |
|---|---|
list[int]
|
List of num_perm integer values forming the MinHash signature. |
Source code in dita_etl/assess/dedupe.py
jaccard_from_signatures(sig1, sig2)
¶
Estimate the Jaccard similarity of two sets from their MinHash signatures.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sig1
|
list[int]
|
MinHash signature for the first set. |
required |
sig2
|
list[int]
|
MinHash signature for the second set. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Estimated Jaccard similarity in [0.0, 1.0]; returns 0.0 if either signature is empty. |
Source code in dita_etl/assess/dedupe.py
cluster_near_duplicates(items, ngram, num_perm, threshold)
¶
Group documents into near-duplicate clusters using MinHash.
Uses a greedy O(n²) clustering approach (sufficient for typical document-set sizes of hundreds to low thousands).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
items
|
list[tuple[str, str]]
|
List of |
required |
ngram
|
int
|
Shingle size for token n-grams. |
required |
num_perm
|
int
|
Number of MinHash permutations. |
required |
threshold
|
float
|
Jaccard similarity threshold; document pairs above this value are placed in the same cluster. |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
List of clusters, each cluster being a list of document keys. Every key appears in exactly one cluster. |