[BUG]: Deadlock (infinite loop) when paddle is not ready
Version
25.09 (25.9.0)
Which installation method(s) does this occur on?
No response
Describe the bug.
Jobs hang indefinitely when Paddle is unready; client defaults to infinite retries; submit_job not gated by readiness
- Impact: Ingestion sessions can run for 30+ minutes with near‑zero utilization when Paddle OCR is unready; no fail‑fast or clear error. Users see 0% progress for small batches.
-
Versions:
- nv-ingest-client: 25.9.0
- nv-ingest-api: 25.9.0
-
Environment: NV‑Ingest API exposed via gateway (
http://<gw-host>:80). Client usesIngestor(...).extract(...).split(...).embed().ingest(show_progress=True)with defaults.
Steps to reproduce
- Run NV‑Ingest API without a working Paddle endpoint (e.g., unset
PADDLE_HTTP_ENDPOINTor set to an unreachable URL inside the cluster). - Confirm health shows Paddle unready:
curl -s http://<gw-host>:80/v1/health/ready # returns 503 with JSON including: "paddle_ready": false - From a client, submit a small batch (e.g., 6 PDFs) using
nv-ingest-clientdefaults (no explicitmax_job_retries). - Observe the client logs: “Starting batch processing for 6 jobs…” and then no progress; process continues indefinitely with low CPU/GPU.
Observed behavior
- Health reports Paddle unready (503) but
submit_jobaccepts requests. Jobs never become ready. Client polls forever by default. - Example client log:
Starting batch processing for 6 jobs with batch size 32. ... 5 minutes elapse ... ⏰ TIMEOUT WARNING: NV-Ingest processing of 6 file(s) has been running for 300.0s (timeout: 300s) - Health sample:
HTTP/1.1 503 {"ingest_ready":true,"pipeline_ready":true,"paddle_ready":false,"yolox_graphic_elements_ready":true,"yolox_page_elements_ready":true,"yolox_table_structure_ready":true}
Expected behavior
- Either:
- API rejects
submit_jobfor tasks that require Paddle (or other unready components) with 503 and a clear message, or - Client fails fast with a clear error when
/v1/health/readyreports unready dependencies, or - Client times out after a finite number of retries by default, returning an actionable error.
- API rejects
Analysis (suspected root cause)
- Health endpoint exposes Paddle readiness:
READY_CHECK_ENV_VAR_MAP = {
"paddle": "PADDLE_HTTP_ENDPOINT",
"yolox_graphic_elements": "YOLOX_GRAPHIC_ELEMENTS_HTTP_ENDPOINT",
"yolox_page_elements": "YOLOX_HTTP_ENDPOINT",
"yolox_table_structure": "YOLOX_TABLE_STRUCTURE_HTTP_ENDPOINT",
}
- Paddle endpoint defaults (when present) are read from env:
paddle_http_endpoint: str = os.getenv("PADDLE_HTTP_ENDPOINT", "https://ai.api.nvidia.com/v1/cv/baidu/paddleocr")
paddle_infer_protocol: str = os.getenv("PADDLE_INFER_PROTOCOL", "http")
- The client’s
Ingestor.ingestdefaults to infinite retries:
DEFAULT_TIMEOUT: int = 100
DEFAULT_MAX_RETRIES: int = None
DEFAULT_VERBOSE: bool = False
timeout: int = kwargs.pop("timeout", DEFAULT_TIMEOUT)
max_job_retries: int = kwargs.pop("max_job_retries", DEFAULT_MAX_RETRIES)
verbose: bool = kwargs.pop("verbose", DEFAULT_VERBOSE)
- Retry loop requeues jobs on 202 (not ready) when
max_job_retriesis None:
except TimeoutError:
self.retry_counts[job_index] += 1
if self.max_job_retries is None or self.retry_counts[job_index] <= self.max_job_retries:
# not ready → keep retrying indefinitely when None
self.retry_job_ids.append(job_index)
else:
...
- Batch cycle continues while there are retries:
logger.info(f"Starting batch processing for {total_jobs} jobs with batch size {self.batch_size}.")
while (submitted_new_indices_count < total_jobs) or self.retry_job_ids:
Proposed fixes
- Server-side:
- Gate
/v1/submit_job: if a submission’s task chain requires Paddle/YOLOX/etc. and the corresponding readiness is false, return 503 with explicit dependency details (e.g., “paddle not ready; check PADDLE_HTTP_ENDPOINT/egress/auth”). - Optionally transition such jobs to FAILED immediately with a clear reason instead of accepting and then returning 202 indefinitely from fetch.
- Gate
- Client-side (nv-ingest-client):
- Change default
max_job_retriesfromNoneto a finite number (e.g., 60) to prevent infinite loops by default. - Before submission, optionally call
/v1/health/readyand fail fast with a user-friendly error if required components are unready (behind a flag that can be enabled by default). - When fetch returns 202 for longer than N minutes, surface a clear error citing
/v1/health/readystate to guide remediation.
- Change default
Workarounds for users (until fixed)
- Disable Paddle-dependent tasks (e.g., image/infographic extraction, NV‑Ingest captions) or explicitly set a finite
max_job_retries. - Ensure
PADDLE_HTTP_ENDPOINTand credentials are configured and reachable from within the cluster.
Acceptance criteria
- Submissions that require an unready dependency are rejected with 503 and a clear message, or client fails fast with an actionable error.
- Default client behavior does not loop indefinitely when dependencies are down.
- Documentation clarifies required env vars and readiness gating behavior.
I can file this for you or tailor the wording for your issue tracker template if you share it.
- Drafted an actionable nv‑ingest issue describing the hang, clear repro, and proposed fixes (server gating and client defaults).
- Highlighted code points where infinite retries and readiness checks are defined, and included health evidence showing Paddle unready.
Minimum reproducible example
Relevant log output
Other/Misc.
No response
Another (slightly related) issue I want to surface: I've noticed in general (from past successful runs) that nv-ingest will just sit there without any logging even when it's processing successfully, which makes it really hard to know when it's having issues. I often need to wait tens of minutes before I know if it's having an issue with a batch.
I just discovered that our helm upgrade hadn't fully completed, so nv-ingest pods were still at the previous release version. I just upgraded and will try it again on 25.9.0
It appears that the infinite loop behavior still exists, though we fixed the paddle issue. It seems that if anything causes a problem during ingestion (embedding service unreachable, token limit exceeded, etc.) for any transient issue, nv-ingest will just spin forever without progressing or logging any issues.