Skip to content

Transforms API

classify

dita_etl.transforms.classify

DITA topic-type classifier — pure functional core.

Classification proceeds in priority order:

  1. Filename rules (glob-style pattern matching against the basename).
  2. Content rules (regex search against the full document text).
  3. Built-in heuristics (keyword frequency in content).
  4. Default fallback → "concept".

All functions are pure: they take data and return data with no side effects.

TOPIC_TYPES = frozenset({'concept', 'task', 'reference'}) module-attribute

classify_topic(filename, content, rules_by_filename, rules_by_content)

Determine the DITA topic type for a document.

:Example:

.. code-block:: python

result = classify_topic(
    "install.md",
    "Click the button to install...",
    rules_by_filename=[],
    rules_by_content=[],
)
assert result == "task"

Parameters:

Name Type Description Default
filename str

Basename of the source file (e.g. "guide.md").

required
content str

Full text content of the (intermediate) document.

required
rules_by_filename list['ClassificationRule']

Ordered list of filename classification rules.

required
rules_by_content list['ClassificationRule']

Ordered list of content classification rules.

required

Returns:

Type Description
str

One of "concept", "task", or "reference".

Source code in dita_etl/transforms/classify.py
def classify_topic(
    filename: str,
    content: str,
    rules_by_filename: list["ClassificationRule"],
    rules_by_content: list["ClassificationRule"],
) -> str:
    """Determine the DITA topic type for a document.

    :param filename: Basename of the source file (e.g. ``"guide.md"``).
    :param content: Full text content of the (intermediate) document.
    :param rules_by_filename: Ordered list of filename classification rules.
    :param rules_by_content: Ordered list of content classification rules.
    :returns: One of ``"concept"``, ``"task"``, or ``"reference"``.

    :Example:

    .. code-block:: python

        result = classify_topic(
            "install.md",
            "Click the button to install...",
            rules_by_filename=[],
            rules_by_content=[],
        )
        assert result == "task"
    """
    # 1. Filename rules (convert simple glob * → .* for regex)
    for rule in rules_by_filename or []:
        pattern = rule.pattern.replace("*", ".*")
        if re.fullmatch(pattern, filename):
            return _validated(rule.topic_type)

    # 2. Content rules
    for rule in rules_by_content or []:
        if re.search(rule.pattern, content, re.IGNORECASE):
            return _validated(rule.topic_type)

    # 3. Heuristics
    if _TASK_RE.search(content):
        return "task"
    if _REF_RE.search(content):
        return "reference"

    # 4. Default
    return "concept"

dita

dita_etl.transforms.dita

Pure DITA XML construction functions (functional core).

All functions are pure: given the same inputs they always return the same output and have no side effects. They produce well-formed DITA 1.3 XML fragments as plain strings; serialisation to disk is handled by the imperative shell.

extract_title(docbook_text)

Extract the first <title> value from DocBook XML text.

Parameters:

Name Type Description Default
docbook_text str

Raw DocBook XML string.

required

Returns:

Type Description
str

Title text, or "Untitled" if no <title> element is found.

Source code in dita_etl/transforms/dita.py
def extract_title(docbook_text: str) -> str:
    """Extract the first ``<title>`` value from DocBook XML text.

    :param docbook_text: Raw DocBook XML string.
    :returns: Title text, or ``"Untitled"`` if no ``<title>`` element is found.
    """
    match = re.search(r"<title>(.*?)</title>", docbook_text, re.IGNORECASE)
    return match.group(1) if match else "Untitled"

extract_body(docbook_text)

Extract paragraph content from DocBook XML text as DITA <p> elements.

Paragraphs inside <para> elements are converted; if none are found the plain text is wrapped in a single <p>.

Parameters:

Name Type Description Default
docbook_text str

Raw DocBook XML string.

required

Returns:

Type Description
str

String of one or more <p> elements suitable for embedding in a DITA topic body.

Source code in dita_etl/transforms/dita.py
def extract_body(docbook_text: str) -> str:
    """Extract paragraph content from DocBook XML text as DITA ``<p>`` elements.

    Paragraphs inside ``<para>`` elements are converted; if none are found
    the plain text is wrapped in a single ``<p>``.

    :param docbook_text: Raw DocBook XML string.
    :returns: String of one or more ``<p>`` elements suitable for embedding
        in a DITA topic body.
    """
    paras = re.findall(r"<para>(.*?)</para>", docbook_text, re.IGNORECASE | re.DOTALL)
    if paras:
        return "".join(f"<p>{p.strip()}</p>" for p in paras)
    # Fallback: strip all tags and wrap in a single paragraph.
    plain = re.sub(r"<[^>]+>", "", docbook_text)[:200]
    return f"<p>{saxutils.escape(plain)}</p>"

build_topic(title, body, topic_type, topic_id='t1')

Render a minimal DITA 1.3 topic element.

:Example:

.. code-block:: python

xml = build_topic("Installation", "<p>Run the installer.</p>", "task")

Parameters:

Name Type Description Default
title str

Topic title text (will be XML-escaped).

required
body str

Pre-formatted body content (inserted verbatim — caller is responsible for validity).

required
topic_type str

One of "concept", "task", or "reference".

required
topic_id str

Value for the element's id attribute.

't1'

Returns:

Type Description
str

Serialised DITA topic XML string.

Raises:

Type Description
ValueError

If topic_type is not a known type.

Source code in dita_etl/transforms/dita.py
def build_topic(title: str, body: str, topic_type: str, topic_id: str = "t1") -> str:
    """Render a minimal DITA 1.3 topic element.

    :param title: Topic title text (will be XML-escaped).
    :param body: Pre-formatted body content (inserted verbatim — caller is
        responsible for validity).
    :param topic_type: One of ``"concept"``, ``"task"``, or ``"reference"``.
    :param topic_id: Value for the element's ``id`` attribute.
    :returns: Serialised DITA topic XML string.
    :raises ValueError: If *topic_type* is not a known type.

    :Example:

    .. code-block:: python

        xml = build_topic("Installation", "<p>Run the installer.</p>", "task")
    """
    template = _TOPIC_BUILDERS.get(topic_type)
    if template is None:
        raise ValueError(
            f"Unknown topic_type '{topic_type}'. Expected one of: "
            + ", ".join(sorted(_TOPIC_BUILDERS))
        )
    return template.format(
        id=topic_id,
        title=saxutils.escape(title),
        body=body,
    )

make_topicref(topic_path, base_dir)

Build a <topicref> element with a path relative to the map file.

Parameters:

Name Type Description Default
topic_path str

Absolute or relative path to the DITA topic file.

required
base_dir str

Directory that the DITA map will be written to. The href attribute will be relative to this directory.

required

Returns:

Type Description
str

A <topicref href="..." /> XML string.

Source code in dita_etl/transforms/dita.py
def make_topicref(topic_path: str, base_dir: str) -> str:
    """Build a ``<topicref>`` element with a path relative to the map file.

    :param topic_path: Absolute or relative path to the DITA topic file.
    :param base_dir: Directory that the DITA map will be written to. The
        ``href`` attribute will be relative to this directory.
    :returns: A ``<topicref href="..." />`` XML string.
    """
    abs_path = pathlib.Path(topic_path).resolve()
    rel_path = abs_path.relative_to(pathlib.Path(base_dir).resolve(), walk_up=True)
    return f'  <topicref href="{rel_path.as_posix()}" />'

build_map(title, topic_paths, base_dir)

Build a complete DITA map XML document.

Parameters:

Name Type Description Default
title str

Human-readable map title (will be XML-escaped).

required
topic_paths list[str]

Paths to all DITA topic files to include, in order.

required
base_dir str

Directory where the map will be written (used to compute relative href values).

required

Returns:

Type Description
str

Complete DITA map XML as a string.

Source code in dita_etl/transforms/dita.py
def build_map(title: str, topic_paths: list[str], base_dir: str) -> str:
    """Build a complete DITA map XML document.

    :param title: Human-readable map title (will be XML-escaped).
    :param topic_paths: Paths to all DITA topic files to include, in order.
    :param base_dir: Directory where the map will be written (used to compute
        relative ``href`` values).
    :returns: Complete DITA map XML as a string.
    """
    refs = "\n".join(make_topicref(p, base_dir) for p in sorted(topic_paths))
    return _MAP_TEMPLATE.format(
        title=saxutils.escape(title),
        refs=refs,
    )