Sourcery Usage Guide#

What Sourcery Is#

Sourcery is both:

A Python library you import (import sourcery) to run schema-first extraction.
A reference project with ingestion adapters, HTML reviewer UI, and runnable integration scripts.

Use it as a library inside your app, and use this repository as a production template.

When To Use Sourcery#

Use Sourcery when you need:

typed extraction contracts (Pydantic models),
grounded spans (char_start, char_end) for every extraction,
deterministic chunking/alignment/merge behavior,
optional document-level reconciliation into canonical claims,
human review/export workflows.

Install#

Python requirement: >=3.12

Minimal runtime:

uv sync

With ingestion adapters (PDF/OCR/URL HTML):

uv sync --extra ingest

With dev tooling:

uv sync --extra dev --extra ingest

Set provider credentials for the model route you use in RuntimeConfig.model (for example DEEPSEEK_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.).

Core Public API#

Import-level API (sourcery/__init__.py):

extract(request: ExtractRequest, engine: SourceryEngine | None = None) -> ExtractResult
aextract(request: ExtractRequest, engine: SourceryEngine | None = None) -> ExtractResult
extract_from_sources(sources, *, task, runtime, options=None, engine=None) -> ExtractResult
aextract_from_sources(...) -> ExtractResult
SourceryEngine with .extract(...), .aextract(...), .replay_run(...)

Data Contracts You Define#

1) `EntitySpec`#

name: str
attributes_model: type[BaseModel]

2) `EntitySchemaSet`#

entities: list[EntitySpec]

3) `ExtractionTask`#

instructions: str
schema: EntitySchemaSet
examples: list[ExtractionExample]
strict_example_alignment: bool = True

4) `ExtractRequest`#

documents: list[SourceDocument] | str
task: ExtractionTask
options: ExtractOptions = ExtractOptions()
runtime: RuntimeConfig

5) `ExtractResult`#

documents: list[DocumentResult]
run_trace: ExtractionRunTrace
metrics: RunMetrics
warnings: list[str]

DocumentResult includes:

extractions: list[AlignedExtraction]
canonical_claims: list[CanonicalClaim]

Runtime Config (`RuntimeConfig`)#

Required:

model: str

Core options:

temperature: float = 0.0
max_tokens: int | None = None
stream: bool = False
storage_dir: str = ".sourcery"
respect_context_window: bool = True

Reliability:

retry: RetryPolicy
max_attempts=3
initial_backoff_seconds=0.75
max_backoff_seconds=8.0
backoff_multiplier=2.0
retry_on_rate_limit=True
retry_on_transient_errors=True
auto_resume_paused_runs=True
max_pause_resumes=5

Session refinement (optional):

session_refinement: SessionRefinementConfig
enabled=False
max_turns=1
context_chars=320

Document-level reconciliation (optional):

reconciliation: ReconciliationConfig
enabled=False
use_workforce=True
min_mentions_for_claim=1
max_claims=200

Extraction Options (`ExtractOptions`)#

max_chunk_chars=1200
context_window_chars=200
max_passes=2
batch_concurrency=16
enable_fuzzy_alignment=True
fuzzy_alignment_threshold=0.82
accept_partial_exact=False
stop_when_no_new_extractions=True
allow_unresolved=False

Minimal Example (Inline Text)#

from pydantic import BaseModel
import sourcery
from sourcery.contracts import (
    EntitySchemaSet,
    EntitySpec,
    ExtractRequest,
    ExtractionExample,
    ExtractionTask,
    ExampleExtraction,
    RuntimeConfig,
)

class PersonAttrs(BaseModel):
    role: str | None = None

request = ExtractRequest(
    documents="Alice is the CEO of Acme.",
    task=ExtractionTask(
        instructions="Extract person entities.",
        schema=EntitySchemaSet(
            entities=[EntitySpec(name="person", attributes_model=PersonAttrs)]
        ),
        examples=[
            ExtractionExample(
                text="Bob is the CTO.",
                extractions=[
                    ExampleExtraction(entity="person", text="Bob", attributes={"role": "CTO"})
                ],
            )
        ],
    ),
    runtime=RuntimeConfig(model="deepseek/deepseek-chat"),
)

result = sourcery.extract(request)
print(result.metrics.model_dump(mode="json"))
for ext in result.documents[0].extractions:
    print(ext.entity, ext.text, ext.char_start, ext.char_end, ext.alignment_status)

Notebook equivalent: examples/notebooks/sourcery_quickstart.ipynb

Extract From Files / PDFs / URLs / Images#

Use the source-based helper:

result = sourcery.extract_from_sources(
    ["1706.03762v7.pdf", "https://example.com/article.html"],
    task=task,
    runtime=RuntimeConfig(model="deepseek/deepseek-chat"),
)

Supported ingestion via load_source_document(s):

Inline text
Text files
PDF files (pypdf)
HTML files / raw HTML
URLs
OCR image files (Pillow + pytesseract)

Notes:

PDF loader is text-extraction first (pypdf).
OCR is currently image-based ingestion, not multimodal LLM extraction.

Notebook equivalent: examples/notebooks/sourcery_pdf_workflow.ipynb

Async Usage#

result = await sourcery.aextract(request)

Advanced Engine Usage#

from sourcery.runtime import SourceryEngine

engine = SourceryEngine()
result = engine.extract(request)

raw_run_id = result.documents[0].extractions[0].provenance.raw_run_id
if raw_run_id:
    replay, events = engine.replay_run(request, raw_run_id)
    print(replay["status"] if replay else None, len(events))

runtime = RuntimeConfig(
    model="deepseek/deepseek-chat",
    session_refinement={"enabled": True, "max_turns": 1, "context_chars": 320},
    reconciliation={"enabled": True, "use_workforce": True, "max_claims": 100},
)

What this does:

Session refinement adds multi-turn continuity hints per chunk.
Reconciliation runs document-level resolver workflow and returns canonical_claims.

Outputs and Review#

Save JSONL#

from sourcery.io import save_extract_result_jsonl
save_extract_result_jsonl(result, "output/result.jsonl")

Generate HTML viewer#

from sourcery.io import write_document_html
write_document_html(result.documents[0], "output/document.viewer.html")

Generate reviewer UI#

from sourcery.io import write_reviewer_html
write_reviewer_html(result.documents[0], "output/document.reviewer.html")

Reviewer supports:

search,
entity/status filters,
approve/reject/reset,
export approved JSONL/CSV.

Scripted End-to-End Runs#

Benchmark comparison wrapper#

uv run benchmark_compare.py --text-types english

Error Model#

Important exception classes (sourcery/exceptions.py):

SourceryError
SourceryRuntimeError
SourceryProviderError
SourceryRateLimitError
SourceryRetryExhaustedError
SourceryPausedRunError
SourceryPipelineError
SourceryIngestionError
SourceryDependencyError

Validation Commands#

uv run --extra dev pytest -q
uv run --extra dev ruff check sourcery tests
uv run --extra dev mypy sourcery

Production Notes#

Treat schemas as API contracts and version them.
Start with strict examples and deterministic options.
Enable reconciliation for long documents where alias/coreference matters.
Keep reviewer approval in-the-loop for high-stakes workflows.
Persist JSONL + run trace for audit and replay.