Software Requirements Specification
Project: markdown-validator Standard: IEEE 830-1998 (Software Requirements Specifications) Version: 1.0 (covering v0.2 current and v0.3–v1.0 planned) Date: 2026-03-20 Author: Matt Briggs
Table of Contents
- Introduction
- Overall Description
- Specific Requirements — Current (v0.2)
- Specific Requirements — Planned (v0.3–v1.0)
- Non-Functional Requirements
- Constraints and Assumptions
- Appendix — Terminology
1. Introduction
1.1 Purpose
This document specifies the requirements for markdown-validator, a rule-based linting
tool for Markdown documentation files used in static site generators. It is intended for:
- Developers implementing or extending the tool
- Technical writers authoring rule sets
- DevOps engineers integrating the tool into CI/CD pipelines
- Product stakeholders evaluating capabilities and planning releases
1.2 Scope
markdown-validator validates Markdown files against a declarative JSON rule set. It is
a command-line tool and Python library. It is not an editor plugin, auto-fixer, or
general-purpose static analysis framework.
In scope: - YAML front-matter metadata validation - Document body validation via XPath on rendered HTML - Conditional rule chains (workflow mini-language) - Batch directory scanning with file-level reporting - Interactive REPL for rule development
Out of scope: - Auto-correcting document content - Validating non-Markdown file formats - Language Server Protocol / real-time linting - Cloud service or SaaS deployment
1.3 Definitions, Acronyms, and Abbreviations
| Term | Definition |
|---|---|
| Rule | A single assertion about a document's metadata or body |
| Rule set | A JSON file containing a collection of rules and workflows |
| Front matter | The YAML block delimited by --- at the start of a Markdown file |
| XPath | XML Path Language; used to query the rendered HTML DOM |
| Workflow | An ordered chain of rule evaluations with conditional branching |
| Operator | A comparison function applied to a query result and an expected value |
| Flag | Processing mode that determines how a query result is extracted |
| ScanReport | The aggregated output of validating one file against a rule set |
| DOM | Document Object Model; the tree representation of the rendered HTML |
| CI/CD | Continuous Integration / Continuous Delivery |
| REPL | Read–Eval–Print Loop; interactive command-line environment |
| Penn Treebank | Standard part-of-speech tag set used by NLTK |
1.4 References
- Architecture Overview
- Design Document
- Rules Reference
- CLI Reference
- Product Assessment
- Product Roadmap
- IEEE Std 830-1998, IEEE Recommended Practice for Software Requirements Specifications
1.5 Overview
Section 2 describes the product context and user characteristics. Sections 3 and 4 contain detailed functional requirements, organised by the current (v0.2) and planned (v0.3–v1.0) versions. Section 5 covers non-functional requirements. Section 6 documents constraints and assumptions.
2. Overall Description
2.1 Product Perspective
markdown-validator is a standalone CLI tool and Python library. It sits in a CI/CD
pipeline as a pre-merge gate, or is run locally by authors during document development.
graph LR
subgraph Author Workflow
A[Author edits .md] --> B[Runs md-validate locally]
B --> C{Passes?}
C -->|No| A
C -->|Yes| D[Opens PR]
end
subgraph CI Pipeline
D --> E[CI runs md-validate]
E --> F{Passes?}
F -->|No| G[PR blocked]
F -->|Yes| H[PR merges]
end
subgraph Rule Governance
I[Content lead maintains rule sets] --> E
I --> B
end
It does not depend on any external service. All validation runs locally against rule-set files that are version-controlled alongside the documentation repository.
2.2 Product Functions
At the highest level, the product does three things:
- Parse — Read a
.mdfile and extract its YAML front matter and rendered HTML body. - Evaluate — Apply each rule in a rule set to the parsed document; produce a pass/fail result per rule.
- Report — Aggregate results into a score and emit a human-readable or machine-readable report.
Workflows add a fourth function:
- Chain — Execute conditional multi-rule sequences where the outcome of one rule determines which rule runs next.
2.3 User Characteristics
| User type | Technical level | Primary interaction |
|---|---|---|
| Technical writer | Low–medium | Runs md-validate validate against their own files; reads text reports |
| Content lead / editor | Medium | Authors and maintains rule-set JSON files; uses the REPL |
| DevOps / platform engineer | High | Integrates the CLI into CI pipelines; consumes JSON/CSV reports |
| Python developer | High | Uses the Scanner Python API; extends operators or parsers |
2.4 Constraints
- Python 3.12 or later is required.
- YAML front matter (
---block) is mandatory; plain Markdown files without front matter are rejected. - The
markdownlibrary renders Markdown to flat HTML; within-section XPath containment queries are not supported in v0.2 (planned for v0.4). - NLTK corpora (
punkt_tab,averaged_perceptron_tagger_eng) must be downloaded before POS/sentence rules can run.
2.5 Assumptions and Dependencies
- Rule-set JSON files are valid UTF-8.
- The tool is invoked by a user or CI agent with read access to the target
.mdfiles and the rule-set file. - Output directories (for
--output) are writable by the invoking process. - The host system has network access at install time (for NLTK corpus download).
3. Specific Requirements — Current (v0.2)
3.1 Parsing requirements
PR-1 The parser shall accept any UTF-8 encoded .md file that begins with a YAML front-matter block delimited by ---.
PR-2 The parser shall extract all YAML key-value pairs from the front-matter block into a dictionary accessible as metadata.
PR-3 The parser shall render the Markdown body (the content after the closing ---) to HTML using the markdown library with the tables and fenced_code extensions enabled.
PR-4 The parser shall parse the rendered HTML into an lxml element tree using etree.HTMLParser, making the DOM available for XPath evaluation.
PR-5 The parser shall raise a ParseError (not silently fail) if the file does not contain a valid front-matter block.
PR-6 The parsed document shall be an immutable value object (ParsedDocument frozen dataclass); the parser shall not mutate it after construction.
3.2 Rule loading requirements
RL-1 The loader shall accept a path to a JSON file conforming to the rule-set schema defined in Design — Contract Schemas.
RL-2 The loader shall coerce integer id values that are represented as strings in the JSON to int at load time, for backward compatibility.
RL-3 The loader shall accept workflow steps in both the canonical S-1,1-E format and the parenthesised (S,1)(1,E) format, normalising both to canonical form.
RL-4 The loader shall inject a type field ("header" or "body") from the section name when a rule object does not include a type field.
RL-5 The loader shall raise a ValidationError if the JSON file fails Pydantic schema validation.
RL-6 The loaded rule set shall be an immutable value object (RuleSetModel); the loader shall not mutate it after construction.
3.3 Evaluation requirements
EV-1 The evaluator shall apply each rule to the parsed document and return a ValidationResult that is either passed=True or passed=False, with a copy of the rule's mitigation string on failure.
EV-2 For header rules, the evaluator shall:
- Retrieve the metadata value at rule.query from ParsedDocument.metadata.
- Apply the flag processing mode (value, check, date, pattern) as specified in the Rules Reference.
- Apply the operator from OPERATOR_REGISTRY identified by rule.operation.
EV-3 For body rules, the evaluator shall:
- Execute rule.query as an XPath expression against ParsedDocument.html_tree.
- Apply the flag processing mode (count, text, dom, all) to extract a result string.
- Apply the operator from OPERATOR_REGISTRY.
EV-4 For operators that expect numeric operands (>, <, l, s), the evaluator shall coerce the extracted string to int before comparison.
EV-5 The evaluator shall return passed=False (not raise) when:
- A metadata key is absent and the flag is value or date
- An XPath expression matches zero nodes and the flag is text or dom
EV-6 The evaluator shall return passed=True when:
- The flag is check and the metadata key is present (regardless of value)
3.4 Operator requirements
OP-1 The following operators shall be registered in OPERATOR_REGISTRY and behave as specified in the Rules Reference — Operators:
| Operator token | Behaviour |
|---|---|
== |
String equality after whitespace strip |
!= |
String inequality after whitespace strip |
> |
Integer greater-than |
< |
Integer less-than |
[] |
Case-insensitive substring containment |
[: |
String starts-with |
:] |
String ends-with |
r |
Python re.search with DOTALL flag |
l |
len(result) < int(value) |
s |
NLTK sentence count <= int(value) |
p<N> |
Penn Treebank POS tag at word position N |
OP-2 Each operator shall be a pure function (result: str, value: str) -> bool with no side effects.
OP-3 A new operator shall be addable by: (a) defining one function in operators.py, (b) adding one entry to OPERATOR_REGISTRY. No other module shall require modification.
3.5 Workflow requirements
WF-1 The workflow engine shall accept a workflow definition as a string of comma-separated <source>-<target> step tokens.
WF-2 The following twelve step patterns shall be recognised and dispatched:
| Pattern | Meaning |
|---|---|
S-N |
Start; load rule N as initial state |
N-D |
Rule N result becomes the decision point |
M-D |
Merge state becomes the decision |
T-N |
If decision True, load rule N |
F-N |
If decision False, load rule N |
T-R |
If decision True, reverse (negate) it |
F-R |
If decision False, reverse (negate) it |
N-M |
Rule N exits into merge state |
M-N |
Merge state exits to rule N |
M-E |
Merge state ends workflow |
N-E |
Rule N result ends workflow |
N-N |
Both rules must pass |
WF-3 The workflow engine shall return a WorkflowResult for each workflow, with passed set to the final boolean state of the workflow execution.
WF-4 The workflow engine shall not raise on an unknown step pattern; it shall emit a warning and treat the step as a no-op.
3.6 Scanning requirements
SC-1 Scanner.validate(file_path, rules_path) shall return a ScanReport containing:
- One ValidationResult per rule in the rule set
- One WorkflowResult per workflow in the rule set
- passed: bool — True if all Required-level rules passed
- score: int — count of passing rules
- total_rules: int — count of all rules
SC-2 Scanner.validate_directory(dir_path, rules_path) shall recursively find all .md files in dir_path and call validate() on each, returning a list of ScanReport.
SC-3 The scanner shall catch ParseError per file and record it as a failed scan for that file without aborting the directory scan.
3.7 CLI requirements
CLI-1 The md-validate validate TARGET --rules RULES command shall call Scanner.validate() or Scanner.validate_directory() and print the report to stdout.
CLI-2 The --format option shall accept text (default), json, and csv.
CLI-3 The --output DIR option shall write one report file per validated .md file into DIR.
CLI-4 The CLI shall exit with code 0 if all Required rules passed, and 1 if any Required rule failed.
CLI-5 The md-validate repl command shall start an interactive session supporting the commands: load, dump metadata, dump html, query, get, eval, quit.
3.8 Reporting requirements
RP-1 Text-format reports shall include, for each rule: rule ID, rule name, pass/fail status, and the mitigation message on failure.
RP-2 JSON-format reports shall be valid JSON objects containing all fields of ScanReport.
RP-3 CSV-format reports shall include one row per rule with columns for rule ID, name, type, level, and pass/fail status.
4. Specific Requirements — Planned (v0.3–v1.0)
4.1 Negation operator (v0.3)
NEG-1 RuleModel shall include a negate: bool field, defaulting to False.
NEG-2 When negate is True, the evaluator shall flip the boolean result of the operator comparison before constructing ValidationResult.
NEG-3 Rule files that omit negate shall behave identically to the current behaviour (backward compatible).
NEG-4 The negate field shall be documented in the Rules Reference with at least two examples: absence check and inverted contains check.
4.2 Not-contains operator (v0.3)
NTC-1 A ![] operator shall be registered in OPERATOR_REGISTRY that returns True when the result does NOT contain the value (case-insensitive).
NTC-2 The ![] operator shall support comma-separated multi-values with the same semantics as [].
NTC-3 Existing rules using [] with workflow inversion shall be migrated to ![] in the reference rule sets.
4.3 Multi-result equality semantics (v0.3)
MRE-1 When an XPath expression returns more than one element and operation is ==, the evaluator shall use any(truth) aggregation (at least one element satisfies) rather than all(truth).
MRE-2 The change in aggregation semantics shall be documented in the Rules Reference under "Multi-element XPath results".
MRE-3 Rules that previously relied on all(truth) behaviour shall be audited and updated in the reference rule sets.
4.4 Parallel directory scanning (v0.3)
PDS-1 Scanner.validate_directory() shall accept an optional workers: int parameter (default 1).
PDS-2 When workers > 1, the scanner shall use concurrent.futures.ThreadPoolExecutor to process files in parallel.
PDS-3 The --workers N CLI option shall pass workers=N to validate_directory().
PDS-4 The order of results in the returned list shall be deterministic (sorted by file path) regardless of the value of workers.
4.5 Structured-DOM parser (v0.4)
SDP-1 The parser shall offer an opt-in structured-DOM mode that wraps heading content in <section> elements, enabling within-section XPath queries.
SDP-2 In structured-DOM mode, all content between one heading and the next shall be wrapped in <section class="hN"> where N is the heading level.
SDP-3 The structured-DOM mode shall be activated by Parser(structured=True) or --structured on the CLI.
SDP-4 The default (unstructured) mode shall remain unchanged for backward compatibility.
SDP-5 The Rules Reference shall document which XPath patterns require structured mode.
4.6 YAML rule-set schema (v0.4)
YRS-1 The package shall export a JSON Schema document for rule-set files to docs/schema/ruleset.schema.json.
YRS-2 The schema shall be generated from the Pydantic RuleSetModel using model_json_schema().
YRS-3 The published schema shall be versioned alongside the package and updated on every release that changes the rule-set format.
4.7 Plugin system for custom operators (v1.0)
PLG-1 Third-party packages shall be able to register custom operators via the markdown_validator.operators entry point in their pyproject.toml.
PLG-2 Operator plugins shall follow the same interface as built-in operators: (result: str, value: str) -> bool.
PLG-3 The package shall document the plugin contract and provide a minimal example plugin in the repository.
4.8 Pre-built rule packs (v1.0)
PRP-1 The project shall publish at least one installable rule-pack package (markdown-validator-docfx) containing a ready-to-use rule set for DocFX documentation repositories.
PRP-2 Rule packs shall be versioned independently of the core tool.
PRP-3 Rule packs shall document every rule they contain: intent, query, operator, expected value, and mitigation.
5. Non-Functional Requirements
5.1 Performance
PERF-1 A single-file validation (parse + evaluate + report) shall complete in under 500 ms on a modern laptop for documents up to 10,000 words.
PERF-2 A 1,000-file directory scan shall complete in under 60 seconds (sequential mode) on a modern laptop.
PERF-3 Memory usage per file shall not exceed 50 MB; the scanner shall not retain parsed documents after reporting.
5.2 Reliability
REL-1 The test suite shall maintain ≥ 90% line coverage, measured by pytest-cov, as a CI gate.
REL-2 A ParseError on one file during a directory scan shall not abort the scan of remaining files.
REL-3 All Pydantic models shall be frozen; no mutable state shall cross layer boundaries.
5.3 Usability
USE-1 The CLI --help output shall list all commands, options, and exit codes.
USE-2 Failure reports shall include the rule's mitigation message verbatim, so the author knows exactly what to fix.
USE-3 The REPL shall print a help summary on startup listing all available commands.
5.4 Maintainability
MNT-1 Each layer (domain, infrastructure, services, CLI) shall have no imports from a layer above it in the dependency hierarchy.
MNT-2 Adding a new operator shall require changes to exactly two locations: the function definition and the registry entry in operators.py.
MNT-3 All public functions and classes shall have docstrings in Sphinx format.
5.5 Security
SEC-1 The tool shall not execute arbitrary code from rule-set files; XPath expressions are evaluated against the document DOM only.
SEC-2 The tool shall not make network requests during validation; all resources (NLTK corpora) shall be pre-downloaded at install time.
SEC-3 The tool shall not write files outside the directory specified by --output.
5.6 Portability
PRT-1 The tool shall run on Linux, macOS, and Windows with Python 3.12+.
PRT-2 The tool shall not depend on any platform-specific libraries beyond what is declared in pyproject.toml.
6. Constraints and Assumptions
6.1 Technical constraints
- Python 3.12 or later is required (uses
match/case,tomllib, and modern type hints). - Markdown files must use UTF-8 encoding.
- YAML front matter delimited by
---is mandatory; the tool is not designed for frontmatter-free Markdown. - The
markdownlibrary (v3.x) is used for HTML rendering; its flat DOM is a known constraint until v0.4.
6.2 Organisational constraints
- The rule-set format is versioned by the package; breaking changes to the schema require a major version bump.
- CI integration is the primary deployment target; the tool must be installable in a fresh virtual environment with a single
pip install.
6.3 Assumptions
- Rule-set authors have basic familiarity with XPath 1.0 for body rules.
- The documentation repository being validated follows a consistent front-matter structure.
- NLTK corpora are downloaded once per environment; they are not re-downloaded on every run.
7. Appendix — Terminology
Rule-set schema quick reference
{
"rules": {
"header": [
{
"id": 1,
"name": "string",
"type": "header",
"query": "metadata-key",
"flag": "value | check | date | pattern",
"operation": "== | != | > | < | [] | [: | :] | r | l",
"value": "expected-value",
"level": "Required | Suggested",
"mitigation": "string"
}
],
"body": [
{
"id": 2,
"name": "string",
"type": "body",
"query": "/xpath/expression",
"flag": "count | text | dom | all",
"operation": "== | != | > | < | [] | r | l | s | p<N>",
"value": "expected-value",
"level": "Required | Suggested",
"mitigation": "string"
}
]
},
"workflows": [
{
"name": "string",
"steps": "S-1,1-E",
"level": "Required | Suggested",
"fix": "string"
}
]
}
Data flow summary
sequenceDiagram
participant CLI
participant Scanner
participant Parser
participant Loader
participant Evaluator
participant Reporter
CLI->>Scanner: validate(file, rules)
Scanner->>Parser: parse(file)
Parser-->>Scanner: ParsedDocument
Scanner->>Loader: load(rules)
Loader-->>Scanner: RuleSetModel
loop Each rule
Scanner->>Evaluator: evaluate_rule(rule, doc)
Evaluator-->>Scanner: ValidationResult
end
Scanner->>Reporter: write(ScanReport)
Reporter-->>CLI: report (text/json/csv)