Transforms API¶

classify¶

`dita_etl.transforms.classify` ¶

DITA topic-type classifier — pure functional core.

Classification proceeds in priority order:

Filename rules (glob-style pattern matching against the basename).
Content rules (regex search against the full document text).
Built-in heuristics (keyword frequency in content).
Default fallback → "concept".

All functions are pure: they take data and return data with no side effects.

`TOPIC_TYPES = frozenset({'concept', 'task', 'reference'})` `module-attribute` ¶

`classify_topic(filename, content, rules_by_filename, rules_by_content)` ¶

Determine the DITA topic type for a document.

:Example:

.. code-block:: python

result = classify_topic(
    "install.md",
    "Click the button to install...",
    rules_by_filename=[],
    rules_by_content=[],
)
assert result == "task"

Parameters:

Name	Type	Description	Default
`filename`	`str`	Basename of the source file (e.g. `"guide.md"`).	required
`content`	`str`	Full text content of the (intermediate) document.	required
`rules_by_filename`	`list['ClassificationRule']`	Ordered list of filename classification rules.	required
`rules_by_content`	`list['ClassificationRule']`	Ordered list of content classification rules.	required

Returns:

Type	Description
`str`	One of `"concept"`, `"task"`, or `"reference"`.

Source code in dita_etl/transforms/classify.py

def classify_topic(
    filename: str,
    content: str,
    rules_by_filename: list["ClassificationRule"],
    rules_by_content: list["ClassificationRule"],
) -> str:
    """Determine the DITA topic type for a document.

    :param filename: Basename of the source file (e.g. ``"guide.md"``).
    :param content: Full text content of the (intermediate) document.
    :param rules_by_filename: Ordered list of filename classification rules.
    :param rules_by_content: Ordered list of content classification rules.
    :returns: One of ``"concept"``, ``"task"``, or ``"reference"``.

    :Example:

    .. code-block:: python

        result = classify_topic(
            "install.md",
            "Click the button to install...",
            rules_by_filename=[],
            rules_by_content=[],
        )
        assert result == "task"
    """
    # 1. Filename rules (convert simple glob * → .* for regex)
    for rule in rules_by_filename or []:
        pattern = rule.pattern.replace("*", ".*")
        if re.fullmatch(pattern, filename):
            return _validated(rule.topic_type)

    # 2. Content rules
    for rule in rules_by_content or []:
        if re.search(rule.pattern, content, re.IGNORECASE):
            return _validated(rule.topic_type)

    # 3. Heuristics
    if _TASK_RE.search(content):
        return "task"
    if _REF_RE.search(content):
        return "reference"

    # 4. Default
    return "concept"

dita¶

`dita_etl.transforms.dita` ¶

Pure DITA XML construction functions (functional core).

All functions are pure: given the same inputs they always return the same output and have no side effects. They produce well-formed DITA 1.3 XML fragments as plain strings; serialisation to disk is handled by the imperative shell.

`extract_title(docbook_text)` ¶

Extract the first <title> value from DocBook XML text.

Parameters:

Name	Type	Description	Default
`docbook_text`	`str`	Raw DocBook XML string.	required

Returns:

Type	Description
`str`	Title text, or `"Untitled"` if no `<title>` element is found.

Source code in dita_etl/transforms/dita.py

def extract_title(docbook_text: str) -> str:
    """Extract the first ``<title>`` value from DocBook XML text.

    :param docbook_text: Raw DocBook XML string.
    :returns: Title text, or ``"Untitled"`` if no ``<title>`` element is found.
    """
    match = re.search(r"<title>(.*?)</title>", docbook_text, re.IGNORECASE)
    return match.group(1) if match else "Untitled"

`extract_body(docbook_text)` ¶

Extract paragraph content from DocBook XML text as DITA <p> elements.

Paragraphs inside <para> elements are converted; if none are found the plain text is wrapped in a single <p>.

Parameters:

Name	Type	Description	Default
`docbook_text`	`str`	Raw DocBook XML string.	required

Returns:

Type	Description
`str`	String of one or more `<p>` elements suitable for embedding in a DITA topic body.

Source code in dita_etl/transforms/dita.py

def extract_body(docbook_text: str) -> str:
    """Extract paragraph content from DocBook XML text as DITA ``<p>`` elements.

    Paragraphs inside ``<para>`` elements are converted; if none are found
    the plain text is wrapped in a single ``<p>``.

    :param docbook_text: Raw DocBook XML string.
    :returns: String of one or more ``<p>`` elements suitable for embedding
        in a DITA topic body.
    """
    paras = re.findall(r"<para>(.*?)</para>", docbook_text, re.IGNORECASE | re.DOTALL)
    if paras:
        return "".join(f"<p>{p.strip()}</p>" for p in paras)
    # Fallback: strip all tags and wrap in a single paragraph.
    plain = re.sub(r"<[^>]+>", "", docbook_text)[:200]
    return f"<p>{saxutils.escape(plain)}</p>"

`build_topic(title, body, topic_type, topic_id='t1')` ¶

Render a minimal DITA 1.3 topic element.

:Example:

.. code-block:: python

xml = build_topic("Installation", "<p>Run the installer.</p>", "task")

Parameters:

Name	Type	Description	Default
`title`	`str`	Topic title text (will be XML-escaped).	required
`body`	`str`	Pre-formatted body content (inserted verbatim — caller is responsible for validity).	required
`topic_type`	`str`	One of `"concept"`, `"task"`, or `"reference"`.	required
`topic_id`	`str`	Value for the element's `id` attribute.	`'t1'`

Returns:

Type	Description
`str`	Serialised DITA topic XML string.

Raises:

Type	Description
`ValueError`	If topic_type is not a known type.

Source code in dita_etl/transforms/dita.py

def build_topic(title: str, body: str, topic_type: str, topic_id: str = "t1") -> str:
    """Render a minimal DITA 1.3 topic element.

    :param title: Topic title text (will be XML-escaped).
    :param body: Pre-formatted body content (inserted verbatim — caller is
        responsible for validity).
    :param topic_type: One of ``"concept"``, ``"task"``, or ``"reference"``.
    :param topic_id: Value for the element's ``id`` attribute.
    :returns: Serialised DITA topic XML string.
    :raises ValueError: If *topic_type* is not a known type.

    :Example:

    .. code-block:: python

        xml = build_topic("Installation", "<p>Run the installer.</p>", "task")
    """
    template = _TOPIC_BUILDERS.get(topic_type)
    if template is None:
        raise ValueError(
            f"Unknown topic_type '{topic_type}'. Expected one of: "
            + ", ".join(sorted(_TOPIC_BUILDERS))
        )
    return template.format(
        id=topic_id,
        title=saxutils.escape(title),
        body=body,
    )

`make_topicref(topic_path, base_dir)` ¶

Build a <topicref> element with a path relative to the map file.

Parameters:

Name	Type	Description	Default
`topic_path`	`str`	Absolute or relative path to the DITA topic file.	required
`base_dir`	`str`	Directory that the DITA map will be written to. The `href` attribute will be relative to this directory.	required

Returns:

Type	Description
`str`	A `<topicref href="..." />` XML string.

Source code in dita_etl/transforms/dita.py

def make_topicref(topic_path: str, base_dir: str) -> str:
    """Build a ``<topicref>`` element with a path relative to the map file.

    :param topic_path: Absolute or relative path to the DITA topic file.
    :param base_dir: Directory that the DITA map will be written to. The
        ``href`` attribute will be relative to this directory.
    :returns: A ``<topicref href="..." />`` XML string.
    """
    abs_path = pathlib.Path(topic_path).resolve()
    rel_path = abs_path.relative_to(pathlib.Path(base_dir).resolve(), walk_up=True)
    return f'  <topicref href="{rel_path.as_posix()}" />'

`build_map(title, topic_paths, base_dir)` ¶

Build a complete DITA map XML document.

Parameters:

Name	Type	Description	Default
`title`	`str`	Human-readable map title (will be XML-escaped).	required
`topic_paths`	`list[str]`	Paths to all DITA topic files to include, in order.	required
`base_dir`	`str`	Directory where the map will be written (used to compute relative `href` values).	required

Returns:

Type	Description
`str`	Complete DITA map XML as a string.

Source code in dita_etl/transforms/dita.py

def build_map(title: str, topic_paths: list[str], base_dir: str) -> str:
    """Build a complete DITA map XML document.

    :param title: Human-readable map title (will be XML-escaped).
    :param topic_paths: Paths to all DITA topic files to include, in order.
    :param base_dir: Directory where the map will be written (used to compute
        relative ``href`` values).
    :returns: Complete DITA map XML as a string.
    """
    refs = "\n".join(make_topicref(p, base_dir) for p in sorted(topic_paths))
    return _MAP_TEMPLATE.format(
        title=saxutils.escape(title),
        refs=refs,
    )

Transforms API¶

classify¶

dita_etl.transforms.classify ¶

TOPIC_TYPES = frozenset({'concept', 'task', 'reference'}) module-attribute ¶

classify_topic(filename, content, rules_by_filename, rules_by_content) ¶

dita¶

dita_etl.transforms.dita ¶

extract_title(docbook_text) ¶

extract_body(docbook_text) ¶

build_topic(title, body, topic_type, topic_id='t1') ¶

make_topicref(topic_path, base_dir) ¶

build_map(title, topic_paths, base_dir) ¶

`dita_etl.transforms.classify` ¶

`TOPIC_TYPES = frozenset({'concept', 'task', 'reference'})` `module-attribute` ¶

`classify_topic(filename, content, rules_by_filename, rules_by_content)` ¶

`dita_etl.transforms.dita` ¶

`extract_title(docbook_text)` ¶

`extract_body(docbook_text)` ¶

`build_topic(title, body, topic_type, topic_id='t1')` ¶

`make_topicref(topic_path, base_dir)` ¶

`build_map(title, topic_paths, base_dir)` ¶