Enable Ingestor to save results to disk
Description
Resolves #722
This PR introduces a new capability to the nv-ingest-client.Ingestor class, allowing users to save the results of an ingestion process directly to disk. This is particularly beneficial for large datasets where holding all processed results (including potentially large base64-encoded media) in memory can lead to high resource consumption or out-of-memory errors.
Proposed API:
ingestor = (
Ingestor()
.files("*.pdf")
.extract(
...
)
.embed()
.save_to_disk(output_directory="/path/to/dir")
results = ingestor.ingest()
When save_to_disk() is configured, ingest() utilizes a completion callback passed to process_jobs_concurrently(). This callback is responsible for writing response['data'] (list of extraction items).
ingest() collects LazyLoadedList instances and returns them compatibility for code that expects a list-like structure, each of which allows list-like indexing and iteration over the items.
>>> print(results)
[<LazyLoadedList file='1016445.pdf.results.jsonl', len=2>, ...]
>>> print(results[0][0].keys())
dict_keys(['document_type', 'metadata'])
>>> print([_ for _ in results[0]])
[{'document_type': 'text', 'metadata': {'content': "simunres PRO...
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
- [ ] If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.
Integration Test Results: GPUs: NVIDIA H100 NVL
nv-ingest-cli check: :white_check_mark: helm check: :x: library mode check: :white_check_mark: audio check: :white_check_mark: pptx/docx/image/txt check: :white_check_mark:
Latency Test (bo20 PDFs) :white_check_mark:: 20/20 docs, 496 pages in 135.77 extraction seconds: 3.65 pages/sec 139.72 e2e (includes indexing) seconds: 3.55 pages/sec
Docker Images: riva-asr:latest paddleocr:1.3.0 llama-3.2-nv-embedqa-1b-v2:1.6.0 nemoretriever-page-elements-v2:1.3.0 llama-3.2-nv-rerankqa-1b-v2:1.5.0 nemoretriever-graphic-elements-v1:1.3.0 nv-ingest:25.3.0 nemoretriever-table-structure-v1:1.3.0 dcgm-exporter:4.1.0-4.0.2-ubuntu22.04
run integration tests
Integration Test Results: GPUs: NVIDIA H100 NVL
nv-ingest-cli check: :white_check_mark: helm check: :x: library mode check: :x: audio check: :white_check_mark: pptx/docx/image/txt check: :white_check_mark:
Latency Test (bo20 PDFs) :white_check_mark:: 20/20 docs, 496 pages in 78.56 extraction seconds: 6.31 pages/sec 82.64 e2e (includes indexing) seconds: 6.00 pages/sec
Docker Images: paddleocr:1.3.0 riva-asr:latest llama-3.2-nv-rerankqa-1b-v2:1.5.0 nemoretriever-table-structure-v1:1.3.0 nemoretriever-graphic-elements-v1:1.3.0 nemoretriever-page-elements-v2:1.3.0 llama-3.2-nv-embedqa-1b-v2:1.6.0 nv-ingest:25.3.0 dcgm-exporter:4.1.0-4.0.2-ubuntu22.04
run integration tests
Integration Test Results: GPUs: NVIDIA A100-SXM4-80GB
nv-ingest-cli check: :white_check_mark: helm check: :x: library mode check: :x: audio check: :white_check_mark: pptx/docx/image/txt check: :white_check_mark:
Latency Test (bo20 PDFs) :white_check_mark:: 20/20 docs, 496 pages in 101.08 extraction seconds: 4.91 pages/sec 105.38 e2e (includes indexing) seconds: 4.71 pages/sec
Docker Images: riva-asr:latest llama-3.2-nv-rerankqa-1b-v2:1.5.0 llama-3.2-nv-embedqa-1b-v2:1.6.0 nemoretriever-page-elements-v2:1.3.0 nemoretriever-table-structure-v1:1.3.0 paddleocr:1.3.0 nemoretriever-graphic-elements-v1:1.3.0 nv-ingest:25.3.0 dcgm-exporter:3.3.5-3.4.1-ubuntu22.04