nv-ingest [BUG]: Deadlock (infinite loop) when paddle is not ready

Version

25.09 (25.9.0)

Which installation method(s) does this occur on?

No response

Describe the bug.

Jobs hang indefinitely when Paddle is unready; client defaults to infinite retries; submit_job not gated by readiness

Impact: Ingestion sessions can run for 30+ minutes with near‑zero utilization when Paddle OCR is unready; no fail‑fast or clear error. Users see 0% progress for small batches.
Versions:
- nv-ingest-client: 25.9.0
- nv-ingest-api: 25.9.0
Environment: NV‑Ingest API exposed via gateway (http://<gw-host>:80). Client uses Ingestor(...).extract(...).split(...).embed().ingest(show_progress=True) with defaults.

Steps to reproduce

Run NV‑Ingest API without a working Paddle endpoint (e.g., unset PADDLE_HTTP_ENDPOINT or set to an unreachable URL inside the cluster).

Confirm health shows Paddle unready:

curl -s http://<gw-host>:80/v1/health/ready
# returns 503 with JSON including: "paddle_ready": false

From a client, submit a small batch (e.g., 6 PDFs) using nv-ingest-client defaults (no explicit max_job_retries).
Observe the client logs: “Starting batch processing for 6 jobs…” and then no progress; process continues indefinitely with low CPU/GPU.

Observed behavior

Health reports Paddle unready (503) but submit_job accepts requests. Jobs never become ready. Client polls forever by default.

Example client log:

Starting batch processing for 6 jobs with batch size 32.
... 5 minutes elapse ...
⏰ TIMEOUT WARNING: NV-Ingest processing of 6 file(s) has been running for 300.0s (timeout: 300s)

Health sample:

HTTP/1.1 503
{"ingest_ready":true,"pipeline_ready":true,"paddle_ready":false,"yolox_graphic_elements_ready":true,"yolox_page_elements_ready":true,"yolox_table_structure_ready":true}

Expected behavior

Either:
- API rejects submit_job for tasks that require Paddle (or other unready components) with 503 and a clear message, or
- Client fails fast with a clear error when /v1/health/ready reports unready dependencies, or
- Client times out after a finite number of retries by default, returning an actionable error.

Analysis (suspected root cause)

Health endpoint exposes Paddle readiness:

READY_CHECK_ENV_VAR_MAP = {
    "paddle": "PADDLE_HTTP_ENDPOINT",
    "yolox_graphic_elements": "YOLOX_GRAPHIC_ELEMENTS_HTTP_ENDPOINT",
    "yolox_page_elements": "YOLOX_HTTP_ENDPOINT",
    "yolox_table_structure": "YOLOX_TABLE_STRUCTURE_HTTP_ENDPOINT",
}

Paddle endpoint defaults (when present) are read from env:

paddle_http_endpoint: str = os.getenv("PADDLE_HTTP_ENDPOINT", "https://ai.api.nvidia.com/v1/cv/baidu/paddleocr")
paddle_infer_protocol: str = os.getenv("PADDLE_INFER_PROTOCOL", "http")

The client’s Ingestor.ingest defaults to infinite retries:

DEFAULT_TIMEOUT: int = 100
DEFAULT_MAX_RETRIES: int = None
DEFAULT_VERBOSE: bool = False
timeout: int = kwargs.pop("timeout", DEFAULT_TIMEOUT)
max_job_retries: int = kwargs.pop("max_job_retries", DEFAULT_MAX_RETRIES)
verbose: bool = kwargs.pop("verbose", DEFAULT_VERBOSE)

Retry loop requeues jobs on 202 (not ready) when max_job_retries is None:

except TimeoutError:
    self.retry_counts[job_index] += 1
    if self.max_job_retries is None or self.retry_counts[job_index] <= self.max_job_retries:
        # not ready → keep retrying indefinitely when None
        self.retry_job_ids.append(job_index)
    else:
        ...

Batch cycle continues while there are retries:

logger.info(f"Starting batch processing for {total_jobs} jobs with batch size {self.batch_size}.")
while (submitted_new_indices_count < total_jobs) or self.retry_job_ids:

Proposed fixes

Server-side:
- Gate /v1/submit_job: if a submission’s task chain requires Paddle/YOLOX/etc. and the corresponding readiness is false, return 503 with explicit dependency details (e.g., “paddle not ready; check PADDLE_HTTP_ENDPOINT/egress/auth”).
- Optionally transition such jobs to FAILED immediately with a clear reason instead of accepting and then returning 202 indefinitely from fetch.
Client-side (nv-ingest-client):
- Change default max_job_retries from None to a finite number (e.g., 60) to prevent infinite loops by default.
- Before submission, optionally call /v1/health/ready and fail fast with a user-friendly error if required components are unready (behind a flag that can be enabled by default).
- When fetch returns 202 for longer than N minutes, surface a clear error citing /v1/health/ready state to guide remediation.

Workarounds for users (until fixed)

Disable Paddle-dependent tasks (e.g., image/infographic extraction, NV‑Ingest captions) or explicitly set a finite max_job_retries.
Ensure PADDLE_HTTP_ENDPOINT and credentials are configured and reachable from within the cluster.

Acceptance criteria

Submissions that require an unready dependency are rejected with 503 and a clear message, or client fails fast with an actionable error.
Default client behavior does not loop indefinitely when dependencies are down.
Documentation clarifies required env vars and readiness gating behavior.

I can file this for you or tailor the wording for your issue tracker template if you share it.

Drafted an actionable nv‑ingest issue describing the hang, clear repro, and proposed fixes (server gating and client defaults).
Highlighted code points where infinite retries and readiness checks are defined, and included health evidence showing Paddle unready.

Minimum reproducible example

Relevant log output

Other/Misc.

No response

Sep 22 '25 23:09 devinbost

Another (slightly related) issue I want to surface: I've noticed in general (from past successful runs) that nv-ingest will just sit there without any logging even when it's processing successfully, which makes it really hard to know when it's having issues. I often need to wait tens of minutes before I know if it's having an issue with a batch.

Sep 23 '25 11:09 devinbost

I just discovered that our helm upgrade hadn't fully completed, so nv-ingest pods were still at the previous release version. I just upgraded and will try it again on 25.9.0

Sep 23 '25 16:09 devinbost

It appears that the infinite loop behavior still exists, though we fixed the paddle issue. It seems that if anything causes a problem during ingestion (embedding service unreachable, token limit exceeded, etc.) for any transient issue, nv-ingest will just spin forever without progressing or logging any issues.

Sep 23 '25 18:09 devinbost