Ingestion#

Ingestion normalizes heterogeneous inputs into SourceDocument. Implementation lives in sourcery/ingest/loaders.py.

Supported Source Types#

Inline text
Local text-like files: .txt, .md, .rst, .csv, .json, .jsonl, .yaml, .yml
PDF files (.pdf, requires pypdf)
HTML files and raw HTML
HTTP/HTTPS URLs
OCR image files: .png, .jpg, .jpeg, .webp, .tiff, .bmp (requires Pillow + pytesseract)

Primary APIs#

load_source_document(source, ...)
load_source_documents(sources, ...)

Extended APIs:

load_pdf_document(path, ...)
load_html_document(source, raw_html=False, ...)
load_url_document(url, ...)
load_ocr_image_document(path, ...)

Examples#

Load one source automatically:

from sourcery.ingest import load_source_document

doc = load_source_document("reports/q4.pdf")

Load multiple mixed sources:

from sourcery.ingest import load_source_documents

documents = load_source_documents([
    "notes.txt",
    "https://example.com/post",
    "Raw inline text to extract from",
])

Load raw HTML string explicitly:

from sourcery.ingest.loaders import load_html_document

doc = load_html_document("<html><body><h1>Hello</h1></body></html>", raw_html=True)

Failure Modes#

Missing optional dependency -> SourceryDependencyError
Empty parsed content -> SourceryIngestionError
Invalid URL passed to load_url_document(...) -> SourceryIngestionError
Missing PDF/HTML/image path in dedicated loaders -> SourceryIngestionError

Operational Notes#

URL ingestion auto-detects PDF vs HTML vs plain text by content type and payload.
OCR requires system Tesseract to be installed in addition to Python packages.
load_source_document("missing/path.txt") does not raise by default; because the path does not exist, it is treated as inline text. Use dedicated loaders when you require strict path existence checks.