Sourcery: Schema-First LLM Document Extraction for Python#
Sourcery converts unstructured text, PDFs, HTML, URLs, and image OCR output into typed, source-grounded Pydantic data.
If you can define your target entities as Pydantic models, you can run reproducible LLM extraction pipelines with deterministic chunking, source-span alignment, retry policy, async execution, streaming chunk events, and reviewable outputs.
What You Build With It#
- Typed entity extraction with strict schema validation.
- Character-grounded spans (
char_start,char_end) for each extraction. - Deterministic chunking, alignment, and merge behavior.
- Optional document-level reconciliation into canonical claims.
- Native async extraction and chunk-level streaming events.
- JSONL + HTML outputs for downstream systems and human review.
Core Boundaries#
- Contracts:
sourcery/contractsdefines all request/result primitives. - Pipeline:
sourcery/pipelinehandles chunking, prompt envelopes, alignment, merge. - Runtime:
sourcery/runtimeexecutes model calls and orchestration. - IO + Review:
sourcery/iopersists and visualizes extraction results.
Start Here#
- Install and configure credentials: Getting Started / Installation
- Run first extraction in <5 minutes: Getting Started / Quickstart
- Build from mixed sources (PDF/URL/text): Guides / Build A Pipeline
- Tune reliability and throughput: Guides / Runtime And Tuning
Minimal End-to-End Path#
uv sync --extra ingest
export DEEPSEEK_API_KEY="..."
uv run python - <<'PY'
from pydantic import BaseModel
import sourcery
from sourcery.contracts import EntitySchemaSet, EntitySpec, ExtractRequest, ExtractionExample, ExtractionTask, ExampleExtraction, RuntimeConfig
class PersonAttrs(BaseModel):
role: str | None = None
request = ExtractRequest(
documents="Alice is CTO at Acme.",
task=ExtractionTask(
instructions="Extract people and role if present.",
schema=EntitySchemaSet(entities=[EntitySpec(name="person", attributes_model=PersonAttrs)]),
examples=[
ExtractionExample(
text="Bob is CEO.",
extractions=[ExampleExtraction(entity="person", text="Bob", attributes={"role": "CEO"})],
)
],
),
runtime=RuntimeConfig(model="deepseek/deepseek-chat"),
)
result = sourcery.extract(request)
print(result.metrics.model_dump(mode="json"))
PY
If you do not use DeepSeek, set the provider key required by your selected model route.
Documentation Build#
Serve locally:
uv run --extra docs mkdocs serve
Build static docs with strict validation:
uv run --extra docs mkdocs build --strict