Pipeline API¶
dita_etl.pipeline
¶
Pipeline orchestrator — imperative shell.
:func:run_pipeline is the single entry point that composes the four
pipeline stages in order. It is the only module that:
- performs filesystem setup (creating output directories),
- constructs stage instances from configuration,
- emits structured log messages at stage boundaries,
- propagates typed contracts between stages.
Pure transformation logic lives in dita_etl.transforms;
I/O primitives live in dita_etl.io.
run_pipeline(config_path='config/config.yaml', assess_config_path='config/assess.yaml', input_dir='sample_data/input')
¶
Run the full ETL pipeline: Assess → Extract → Transform → Load.
:Example:
.. code-block:: python
result = run_pipeline(
config_path="config/config.yaml",
assess_config_path="config/assess.yaml",
input_dir="docs/",
)
print(f"Map written to: {result.map_path}")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path
|
str
|
Path to the main pipeline YAML configuration file. |
'config/config.yaml'
|
assess_config_path
|
str
|
Path to the assessment YAML configuration file. |
'config/assess.yaml'
|
input_dir
|
str
|
Root directory containing source documents. |
'sample_data/input'
|
Returns:
| Type | Description |
|---|---|
PipelineOutput
|
:class: |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If config_path or input_dir does not exist. |
Source code in dita_etl/pipeline.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |