nv-ingest icon indicating copy to clipboard operation
nv-ingest copied to clipboard

Enable Ingestor to save results to disk

Open edknv opened this issue 9 months ago • 5 comments

Description

Resolves #722

This PR introduces a new capability to the nv-ingest-client.Ingestor class, allowing users to save the results of an ingestion process directly to disk. This is particularly beneficial for large datasets where holding all processed results (including potentially large base64-encoded media) in memory can lead to high resource consumption or out-of-memory errors.

Proposed API:

ingestor = (
    Ingestor()
    .files("*.pdf")
    .extract(
        ...
    )
    .embed()
    .save_to_disk(output_directory="/path/to/dir")

results = ingestor.ingest()

When save_to_disk() is configured, ingest() utilizes a completion callback passed to process_jobs_concurrently(). This callback is responsible for writing response['data'] (list of extraction items).

ingest() collects LazyLoadedList instances and returns them compatibility for code that expects a list-like structure, each of which allows list-like indexing and iteration over the items.

>>> print(results)
[<LazyLoadedList file='1016445.pdf.results.jsonl', len=2>, ...]

>>> print(results[0][0].keys())
dict_keys(['document_type', 'metadata'])

>>> print([_ for _ in results[0]])
[{'document_type': 'text', 'metadata': {'content': "simunres PRO...

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.
  • [ ] If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

edknv avatar May 21 '25 02:05 edknv

Integration Test Results: GPUs: NVIDIA H100 NVL

nv-ingest-cli check: :white_check_mark: helm check: :x: library mode check: :white_check_mark: audio check: :white_check_mark: pptx/docx/image/txt check: :white_check_mark:

Latency Test (bo20 PDFs) :white_check_mark:: 20/20 docs, 496 pages in 135.77 extraction seconds: 3.65 pages/sec 139.72 e2e (includes indexing) seconds: 3.55 pages/sec

Docker Images: riva-asr:latest paddleocr:1.3.0 llama-3.2-nv-embedqa-1b-v2:1.6.0 nemoretriever-page-elements-v2:1.3.0 llama-3.2-nv-rerankqa-1b-v2:1.5.0 nemoretriever-graphic-elements-v1:1.3.0 nv-ingest:25.3.0 nemoretriever-table-structure-v1:1.3.0 dcgm-exporter:4.1.0-4.0.2-ubuntu22.04

randerzander avatar May 21 '25 20:05 randerzander

run integration tests

edknv avatar May 21 '25 23:05 edknv

Integration Test Results: GPUs: NVIDIA H100 NVL

nv-ingest-cli check: :white_check_mark: helm check: :x: library mode check: :x: audio check: :white_check_mark: pptx/docx/image/txt check: :white_check_mark:

Latency Test (bo20 PDFs) :white_check_mark:: 20/20 docs, 496 pages in 78.56 extraction seconds: 6.31 pages/sec 82.64 e2e (includes indexing) seconds: 6.00 pages/sec

Docker Images: paddleocr:1.3.0 riva-asr:latest llama-3.2-nv-rerankqa-1b-v2:1.5.0 nemoretriever-table-structure-v1:1.3.0 nemoretriever-graphic-elements-v1:1.3.0 nemoretriever-page-elements-v2:1.3.0 llama-3.2-nv-embedqa-1b-v2:1.6.0 nv-ingest:25.3.0 dcgm-exporter:4.1.0-4.0.2-ubuntu22.04

randerzander avatar May 22 '25 17:05 randerzander

run integration tests

randerzander avatar May 23 '25 03:05 randerzander

Integration Test Results: GPUs: NVIDIA A100-SXM4-80GB

nv-ingest-cli check: :white_check_mark: helm check: :x: library mode check: :x: audio check: :white_check_mark: pptx/docx/image/txt check: :white_check_mark:

Latency Test (bo20 PDFs) :white_check_mark:: 20/20 docs, 496 pages in 101.08 extraction seconds: 4.91 pages/sec 105.38 e2e (includes indexing) seconds: 4.71 pages/sec

Docker Images: riva-asr:latest llama-3.2-nv-rerankqa-1b-v2:1.5.0 llama-3.2-nv-embedqa-1b-v2:1.6.0 nemoretriever-page-elements-v2:1.3.0 nemoretriever-table-structure-v1:1.3.0 paddleocr:1.3.0 nemoretriever-graphic-elements-v1:1.3.0 nv-ingest:25.3.0 dcgm-exporter:3.3.5-3.4.1-ubuntu22.04

randerzander avatar May 23 '25 04:05 randerzander