Skip to content

Contracts API

dita_etl.contracts

Stage input/output contracts.

All contracts are immutable frozen dataclasses that validate their contents on construction. They form the type-safe boundaries between pipeline stages, making implicit assumptions explicit and enabling confident refactoring.

Example usage::

input_ = ExtractInput(
    source_paths=("docs/guide.md", "docs/ref.html"),
    intermediate_dir="build/intermediate",
)
output = extract_stage.run(input_)
assert output.success

ContractError

Bases: ValueError

Raised when a stage contract is violated at construction time.

Source code in dita_etl/contracts.py
class ContractError(ValueError):
    """Raised when a stage contract is violated at construction time."""

AssessInput dataclass

Input contract for the Assess stage.

Parameters:

Name Type Description Default
source_paths tuple[str, ...]

Absolute or relative paths of all source files to assess.

required
output_dir str

Directory where assessment artefacts will be written.

required
config_path str

Path to the assessment YAML configuration file.

required
Source code in dita_etl/contracts.py
@dataclass(frozen=True)
class AssessInput:
    """Input contract for the Assess stage.

    :param source_paths: Absolute or relative paths of all source files to
        assess.
    :param output_dir: Directory where assessment artefacts will be written.
    :param config_path: Path to the assessment YAML configuration file.
    """

    source_paths: tuple[str, ...]
    output_dir: str
    config_path: str

    def __post_init__(self) -> None:
        if not self.source_paths:
            raise ContractError("AssessInput.source_paths must not be empty")
        if not self.output_dir:
            raise ContractError("AssessInput.output_dir must not be empty")
        if not self.config_path:
            raise ContractError("AssessInput.config_path must not be empty")

AssessOutput dataclass

Output contract for the Assess stage.

Parameters:

Name Type Description Default
inventory_path str

Path to the written inventory.json file.

required
dedupe_path str

Path to the written dedupe_map.json file.

required
report_path str

Path to the written HTML report file.

required
plans_dir str

Directory containing per-file conversion plan JSONs.

required
Source code in dita_etl/contracts.py
@dataclass(frozen=True)
class AssessOutput:
    """Output contract for the Assess stage.

    :param inventory_path: Path to the written ``inventory.json`` file.
    :param dedupe_path: Path to the written ``dedupe_map.json`` file.
    :param report_path: Path to the written HTML report file.
    :param plans_dir: Directory containing per-file conversion plan JSONs.
    """

    inventory_path: str
    dedupe_path: str
    report_path: str
    plans_dir: str

    def __post_init__(self) -> None:
        if not self.inventory_path:
            raise ContractError("AssessOutput.inventory_path must not be empty")

ExtractInput dataclass

Input contract for the Extract stage.

Parameters:

Name Type Description Default
source_paths tuple[str, ...]

Paths to source documents to convert.

required
intermediate_dir str

Directory where intermediate DocBook XML files will be written.

required
handler_overrides dict[str, str]

Optional mapping of file extension to extractor name, e.g. {".docx": "oxygen-docx"}.

dict()
max_workers int | None

Thread-pool size for parallel extraction. None uses a sensible default based on CPU count.

None
Source code in dita_etl/contracts.py
@dataclass(frozen=True)
class ExtractInput:
    """Input contract for the Extract stage.

    :param source_paths: Paths to source documents to convert.
    :param intermediate_dir: Directory where intermediate DocBook XML files
        will be written.
    :param handler_overrides: Optional mapping of file extension to extractor
        name, e.g. ``{".docx": "oxygen-docx"}``.
    :param max_workers: Thread-pool size for parallel extraction. ``None``
        uses a sensible default based on CPU count.
    """

    source_paths: tuple[str, ...]
    intermediate_dir: str
    handler_overrides: dict[str, str] = field(default_factory=dict)
    max_workers: int | None = None

    def __post_init__(self) -> None:
        if not self.source_paths:
            raise ContractError("ExtractInput.source_paths must not be empty")
        if not self.intermediate_dir:
            raise ContractError("ExtractInput.intermediate_dir must not be empty")
        if self.max_workers is not None and self.max_workers < 1:
            raise ContractError("ExtractInput.max_workers must be >= 1")

ExtractOutput dataclass

Output contract for the Extract stage.

Parameters:

Name Type Description Default
outputs dict[str, str]

Mapping of source path → intermediate XML path for every successfully extracted file.

required
errors dict[str, str]

Mapping of source path → error message for every file that failed extraction.

required
Source code in dita_etl/contracts.py
@dataclass(frozen=True)
class ExtractOutput:
    """Output contract for the Extract stage.

    :param outputs: Mapping of source path → intermediate XML path for every
        successfully extracted file.
    :param errors: Mapping of source path → error message for every file that
        failed extraction.
    """

    outputs: dict[str, str]
    errors: dict[str, str]

    @property
    def success(self) -> bool:
        """``True`` when no extraction errors occurred."""
        return len(self.errors) == 0

success property

True when no extraction errors occurred.

TransformInput dataclass

Input contract for the Transform stage.

Parameters:

Name Type Description Default
intermediates dict[str, str]

Mapping of source path → intermediate XML path (output of the Extract stage).

required
output_dir str

Directory where DITA topic files will be written.

required
rules_by_filename tuple[object, ...]

Classification rules matched against filenames.

tuple()
rules_by_content tuple[object, ...]

Classification rules matched against file content.

tuple()
Source code in dita_etl/contracts.py
@dataclass(frozen=True)
class TransformInput:
    """Input contract for the Transform stage.

    :param intermediates: Mapping of source path → intermediate XML path
        (output of the Extract stage).
    :param output_dir: Directory where DITA topic files will be written.
    :param rules_by_filename: Classification rules matched against filenames.
    :param rules_by_content: Classification rules matched against file content.
    """

    intermediates: dict[str, str]
    output_dir: str
    rules_by_filename: tuple[object, ...] = field(default_factory=tuple)
    rules_by_content: tuple[object, ...] = field(default_factory=tuple)

    def __post_init__(self) -> None:
        if not self.output_dir:
            raise ContractError("TransformInput.output_dir must not be empty")

TransformOutput dataclass

Output contract for the Transform stage.

Parameters:

Name Type Description Default
topics dict[str, list[str]]

Mapping of source path → list of generated DITA topic paths.

required
errors dict[str, str]

Mapping of source path → error message for every file that failed transformation.

required
Source code in dita_etl/contracts.py
@dataclass(frozen=True)
class TransformOutput:
    """Output contract for the Transform stage.

    :param topics: Mapping of source path → list of generated DITA topic
        paths.
    :param errors: Mapping of source path → error message for every file that
        failed transformation.
    """

    topics: dict[str, list[str]]
    errors: dict[str, str]

    @property
    def success(self) -> bool:
        """``True`` when no transform errors occurred."""
        return len(self.errors) == 0

success property

True when no transform errors occurred.

LoadInput dataclass

Input contract for the Load stage.

Parameters:

Name Type Description Default
topics dict[str, list[str]]

Mapping of source path → list of DITA topic paths (output of the Transform stage).

required
output_dir str

Directory where the DITA map and assets will be written.

required
map_title str

Human-readable title for the generated DITA map.

required
intermediate_dir str | None

Optional path to the intermediate directory so that assets (images, styles) can be copied to the output.

None
Source code in dita_etl/contracts.py
@dataclass(frozen=True)
class LoadInput:
    """Input contract for the Load stage.

    :param topics: Mapping of source path → list of DITA topic paths (output
        of the Transform stage).
    :param output_dir: Directory where the DITA map and assets will be written.
    :param map_title: Human-readable title for the generated DITA map.
    :param intermediate_dir: Optional path to the intermediate directory so
        that assets (images, styles) can be copied to the output.
    """

    topics: dict[str, list[str]]
    output_dir: str
    map_title: str
    intermediate_dir: str | None = None

    def __post_init__(self) -> None:
        if not self.output_dir:
            raise ContractError("LoadInput.output_dir must not be empty")
        if not self.map_title:
            raise ContractError("LoadInput.map_title must not be empty")

LoadOutput dataclass

Output contract for the Load stage.

Parameters:

Name Type Description Default
map_path str

Absolute path to the written DITA map file.

required
topic_count int

Number of topic references included in the map.

required
Source code in dita_etl/contracts.py
@dataclass(frozen=True)
class LoadOutput:
    """Output contract for the Load stage.

    :param map_path: Absolute path to the written DITA map file.
    :param topic_count: Number of topic references included in the map.
    """

    map_path: str
    topic_count: int

    def __post_init__(self) -> None:
        if self.topic_count < 0:
            raise ContractError("LoadOutput.topic_count must be >= 0")

PipelineOutput dataclass

Aggregated result returned by the full pipeline run.

Parameters:

Name Type Description Default
assess AssessOutput

Output from the Assess stage.

required
extract ExtractOutput

Output from the Extract stage.

required
transform TransformOutput

Output from the Transform stage.

required
load LoadOutput

Output from the Load stage.

required
Source code in dita_etl/contracts.py
@dataclass(frozen=True)
class PipelineOutput:
    """Aggregated result returned by the full pipeline run.

    :param assess: Output from the Assess stage.
    :param extract: Output from the Extract stage.
    :param transform: Output from the Transform stage.
    :param load: Output from the Load stage.
    """

    assess: AssessOutput
    extract: ExtractOutput
    transform: TransformOutput
    load: LoadOutput

    @property
    def map_path(self) -> str:
        """Convenience accessor for the final DITA map path."""
        return self.load.map_path

map_path property

Convenience accessor for the final DITA map path.