Sourcery#
Sourcery is a schema-first extraction framework for turning unstructured text, files, URLs, and documents into typed, grounded entities.
If you can define your target entities as Pydantic models, you can run reproducible extraction pipelines with traceability, retry policy, and reviewable outputs.
What You Build With It#
- Typed entity extraction with strict schema validation.
- Character-grounded spans (
char_start,char_end) for each extraction. - Deterministic chunking, alignment, and merge behavior.
- JSONL + HTML outputs for downstream systems and human review.
Core Boundaries#
- Contracts:
sourcery/contractsdefines all request/result primitives. - Pipeline:
sourcery/pipelinehandles chunking, prompt envelopes, alignment, merge. - Runtime:
sourcery/runtimeexecutes model calls and orchestration. - IO + Review:
sourcery/iopersists and visualizes extraction results.
Start Here#
- Install and configure credentials: Getting Started / Installation
- Run first extraction in <5 minutes: Getting Started / Quickstart
- Build from mixed sources (PDF/URL/text): Guides / Build A Pipeline
- Tune reliability and throughput: Guides / Runtime And Tuning
Minimal End-to-End Path#
uv sync --extra ingest
export DEEPSEEK_API_KEY="..."
uv run python - <<'PY'
from pydantic import BaseModel
import sourcery
from sourcery.contracts import EntitySchemaSet, EntitySpec, ExtractRequest, ExtractionExample, ExtractionTask, ExampleExtraction, RuntimeConfig
class PersonAttrs(BaseModel):
role: str | None = None
request = ExtractRequest(
documents="Alice is CTO at Acme.",
task=ExtractionTask(
instructions="Extract people and role if present.",
schema=EntitySchemaSet(entities=[EntitySpec(name="person", attributes_model=PersonAttrs)]),
examples=[
ExtractionExample(
text="Bob is CEO.",
extractions=[ExampleExtraction(entity="person", text="Bob", attributes={"role": "CEO"})],
)
],
),
runtime=RuntimeConfig(model="deepseek/deepseek-chat"),
)
result = sourcery.extract(request)
print(result.metrics.model_dump(mode="json"))
PY
If you do not use DeepSeek, set the provider key required by your selected model route.
Documentation Build#
Serve locally:
uv run --extra docs mkdocs serve
Build static docs with strict validation:
uv run --extra docs mkdocs build --strict