Sourcery Public API Reference#

Top-level exports are defined in sourcery/__init__.py.

Core Functions#

extract(request, engine=None) -> ExtractResult
aextract(request, engine=None) -> ExtractResult
extract_from_sources(sources, *, task, runtime, options=None, engine=None) -> ExtractResult
aextract_from_sources(sources, *, task, runtime, options=None, engine=None) -> ExtractResult

Runtime Entry Points#

SourceryEngine.extract(request) -> ExtractResult
SourceryEngine.aextract(request) -> ExtractResult
SourceryEngine.extract_stream(request) -> Generator[StreamEvent, None, ExtractResult]
SourceryEngine.replay_run(request, raw_run_id) -> tuple[dict[str, object] | None, list[EventRecord]]

Top-Level Contract Exports (import sourcery)#

  • EntitySpec, EntitySchemaSet
  • ExtractionTask, ExtractionExample, ExampleExtraction
  • ExtractRequest, ExtractOptions, ExtractResult
  • RuntimeConfig, RetryPolicy, SessionRefinementConfig, ReconciliationConfig
  • AlignedExtraction, CanonicalClaim, DocumentResult, DocumentReconciliationReport
  • ExtractionRunTrace, RunMetrics, SourceDocument

Additional contracts from sourcery.contracts:

  • StreamExtractionAdded, StreamChunkDone, StreamPassDone
  • EngineDependencies

Ingestion Exports#

From sourcery.ingest:

  • load_source_document(...)
  • load_source_documents(...)
  • load_pdf_document(...)
  • load_html_document(...)
  • load_url_document(...)
  • load_vlm_ocr_document(...)
  • load_vlm_ocr_documents(...)
  • VLMOCRBackend — protocol for custom VLM OCR backends
  • BlackGeorgeVLMOCRBackend(...) — blackgeorge multimodal implementation

Top-level shortcut (import sourcery) includes only:

  • load_source_document(...)
  • load_source_documents(...)

IO Exports#

From sourcery.io:

  • save_extract_result_jsonl(...)
  • iter_document_rows(...)
  • load_document_results_jsonl(...)
  • render_document_html(...)
  • write_document_html(...)
  • render_reviewer_html(...)
  • write_reviewer_html(...)
  • visualize(...)

Top-level shortcut (import sourcery) includes:

  • write_reviewer_html(...)

Convenience Example#

import sourcery
from sourcery.contracts import ExtractionTask, RuntimeConfig

result = sourcery.extract_from_sources(
    ["sample.pdf", "https://example.com/article"],
    task=ExtractionTask(...),
    runtime=RuntimeConfig(model="deepseek/deepseek-chat"),
)