Load Text, PDFs, URLs, HTML, and Images into Sourcery#

Ingestion normalizes heterogeneous inputs into SourceDocument. Implementation lives in sourcery/ingest/loaders.py.

Supported Source Types#

  • Inline text
  • Local text-like files: .txt, .md, .rst, .csv, .json, .jsonl, .yaml, .yml
  • PDF files (.pdf, requires pypdf)
  • HTML files and raw HTML
  • HTTP/HTTPS URLs

Primary APIs#

  • load_source_document(source, ...)
  • load_source_documents(sources, ...)

Extended APIs:

  • load_pdf_document(path, ...)
  • load_html_document(source, raw_html=False, ...)
  • load_url_document(url, ...)

Examples#

Load one source automatically:

from sourcery.ingest import load_source_document

doc = load_source_document("reports/q4.pdf")

Load multiple mixed sources:

from sourcery.ingest import load_source_documents

documents = load_source_documents([
    "notes.txt",
    "https://example.com/post",
    "Raw inline text to extract from",
])

Load raw HTML string explicitly:

from sourcery.ingest.loaders import load_html_document

doc = load_html_document("<html><body><h1>Hello</h1></body></html>", raw_html=True)

Failure Modes#

  • Missing optional dependency -> SourceryDependencyError
  • Empty parsed content -> SourceryIngestionError
  • Invalid URL passed to load_url_document(...) -> SourceryIngestionError
  • Missing PDF/HTML path in dedicated loaders -> SourceryIngestionError

VLM OCR#

Image-based text extraction via any vision-language model. Sourcery provides the interface; the model does the work. No OCR-specific dependencies required.

Interface#

VLMOCRBackend is a protocol — implement extract_text(*, image_path, prompt) -> str and it works. Sourcery ships BlackGeorgeVLMOCRBackend which delegates to blackgeorge's multimodal worker. Swap the runtime.model to use any VLM your provider supports.

Examples#

from sourcery.contracts import RuntimeConfig
from sourcery.ingest import BlackGeorgeVLMOCRBackend, load_vlm_ocr_document

backend = BlackGeorgeVLMOCRBackend(
    RuntimeConfig(model="gemini/gemini-2.5-flash", temperature=0.0),
)

doc = load_vlm_ocr_document("scan.png", backend=backend)

With a custom prompt for structured extraction:

doc = load_vlm_ocr_document(
    "invoice.jpg",
    backend=backend,
    prompt="Extract invoice number, date, total amount, and line items as a table.",
)

Batch processing:

from sourcery.ingest import load_vlm_ocr_documents

docs = load_vlm_ocr_documents(
    ["page1.png", "page2.png", "page3.png"],
    backend=backend,
)

Custom backend for any VLM:

from sourcery.ingest import VLMOCRBackend

class MyVLMBackend:
    def extract_text(self, *, image_path, prompt=None):
        # call your model here
        return extracted_text

assert isinstance(MyVLMBackend(), VLMOCRBackend)  # True
doc = load_vlm_ocr_document("scan.png", backend=MyVLMBackend())

Operational Notes#

  • URL ingestion auto-detects PDF vs HTML vs plain text by content type and payload.
  • load_source_document("missing/path.txt") does not raise by default; because the path does not exist, it is treated as inline text. Use dedicated loaders when you require strict path existence checks.
  • VLM OCR loaders raise SourceryIngestionError on missing files or empty model output.