Load Text, PDFs, URLs, HTML, and Images into Sourcery#
Ingestion normalizes heterogeneous inputs into SourceDocument.
Implementation lives in sourcery/ingest/loaders.py.
Supported Source Types#
- Inline text
- Local text-like files:
.txt,.md,.rst,.csv,.json,.jsonl,.yaml,.yml - PDF files (
.pdf, requirespypdf) - HTML files and raw HTML
- HTTP/HTTPS URLs
Primary APIs#
load_source_document(source, ...)load_source_documents(sources, ...)
Extended APIs:
load_pdf_document(path, ...)load_html_document(source, raw_html=False, ...)load_url_document(url, ...)
Examples#
Load one source automatically:
from sourcery.ingest import load_source_document
doc = load_source_document("reports/q4.pdf")
Load multiple mixed sources:
from sourcery.ingest import load_source_documents
documents = load_source_documents([
"notes.txt",
"https://example.com/post",
"Raw inline text to extract from",
])
Load raw HTML string explicitly:
from sourcery.ingest.loaders import load_html_document
doc = load_html_document("<html><body><h1>Hello</h1></body></html>", raw_html=True)
Failure Modes#
- Missing optional dependency ->
SourceryDependencyError - Empty parsed content ->
SourceryIngestionError - Invalid URL passed to
load_url_document(...)->SourceryIngestionError - Missing PDF/HTML path in dedicated loaders ->
SourceryIngestionError
VLM OCR#
Image-based text extraction via any vision-language model. Sourcery provides the interface; the model does the work. No OCR-specific dependencies required.
Interface#
VLMOCRBackend is a protocol — implement extract_text(*, image_path, prompt) -> str and it works. Sourcery ships BlackGeorgeVLMOCRBackend which delegates to blackgeorge's multimodal worker. Swap the runtime.model to use any VLM your provider supports.
Examples#
from sourcery.contracts import RuntimeConfig
from sourcery.ingest import BlackGeorgeVLMOCRBackend, load_vlm_ocr_document
backend = BlackGeorgeVLMOCRBackend(
RuntimeConfig(model="gemini/gemini-2.5-flash", temperature=0.0),
)
doc = load_vlm_ocr_document("scan.png", backend=backend)
With a custom prompt for structured extraction:
doc = load_vlm_ocr_document(
"invoice.jpg",
backend=backend,
prompt="Extract invoice number, date, total amount, and line items as a table.",
)
Batch processing:
from sourcery.ingest import load_vlm_ocr_documents
docs = load_vlm_ocr_documents(
["page1.png", "page2.png", "page3.png"],
backend=backend,
)
Custom backend for any VLM:
from sourcery.ingest import VLMOCRBackend
class MyVLMBackend:
def extract_text(self, *, image_path, prompt=None):
# call your model here
return extracted_text
assert isinstance(MyVLMBackend(), VLMOCRBackend) # True
doc = load_vlm_ocr_document("scan.png", backend=MyVLMBackend())
Operational Notes#
- URL ingestion auto-detects PDF vs HTML vs plain text by content type and payload.
load_source_document("missing/path.txt")does not raise by default; because the path does not exist, it is treated as inline text. Use dedicated loaders when you require strict path existence checks.- VLM OCR loaders raise
SourceryIngestionErroron missing files or empty model output.