Pipeline Internals#
Pipeline modules are deterministic and side-effect free.
chunking.py#
Responsibilities:
- Segment document text into stable chunk windows.
- Preserve global offsets (
char_start,char_end) in eachTextChunk. - Attach optional previous context for continuity.
Key operational knobs from ExtractOptions:
max_chunk_charscontext_window_chars
plan_chunks(...) receives pass_id for chunk identity; pass iteration (max_passes) is controlled by SourceryEngine.
prompt_compiler.py#
Responsibilities:
- Build system/user prompt envelopes for runtime workers.
- Serialize entity schema and examples into a stable payload format.
- Inject pass/chunk context.
Output primitive:
PromptEnvelope(system, user, schema_payload)
aligner.py#
Responsibilities:
- Map runtime
ExtractionCandidatevalues back onto source text. - Produce
AlignedExtractionwith offsets and alignment status.
Resolution strategy:
- Exact span match
- Fuzzy match (optional)
- Partial exact fallback (optional)
- Unresolved when match fails
merger.py#
Responsibilities:
- Merge aligned extractions per document.
- Resolve overlaps deterministically.
Winner precedence favors stronger grounding and stable ordering to maintain reproducibility across runs.
example_validator.py#
Responsibilities:
- Validate that each example extraction is alignable in its example text.
- Emit issues with statuses (
fuzzy,unresolved) when exact span matching fails. - Raise
ExampleValidationErrorwhen strict mode requires it.