nv-ingest icon indicating copy to clipboard operation
nv-ingest copied to clipboard

[BUG]: Deadlock (infinite loop) when paddle is not ready

Open devinbost opened this issue 3 months ago • 3 comments

Version

25.09 (25.9.0)

Which installation method(s) does this occur on?

No response

Describe the bug.

Jobs hang indefinitely when Paddle is unready; client defaults to infinite retries; submit_job not gated by readiness

  • Impact: Ingestion sessions can run for 30+ minutes with near‑zero utilization when Paddle OCR is unready; no fail‑fast or clear error. Users see 0% progress for small batches.
  • Versions:
    • nv-ingest-client: 25.9.0
    • nv-ingest-api: 25.9.0
  • Environment: NV‑Ingest API exposed via gateway (http://<gw-host>:80). Client uses Ingestor(...).extract(...).split(...).embed().ingest(show_progress=True) with defaults.

Steps to reproduce

  1. Run NV‑Ingest API without a working Paddle endpoint (e.g., unset PADDLE_HTTP_ENDPOINT or set to an unreachable URL inside the cluster).
  2. Confirm health shows Paddle unready:
    curl -s http://<gw-host>:80/v1/health/ready
    # returns 503 with JSON including: "paddle_ready": false
    
  3. From a client, submit a small batch (e.g., 6 PDFs) using nv-ingest-client defaults (no explicit max_job_retries).
  4. Observe the client logs: “Starting batch processing for 6 jobs…” and then no progress; process continues indefinitely with low CPU/GPU.

Observed behavior

  • Health reports Paddle unready (503) but submit_job accepts requests. Jobs never become ready. Client polls forever by default.
  • Example client log:
    Starting batch processing for 6 jobs with batch size 32.
    ... 5 minutes elapse ...
    ⏰ TIMEOUT WARNING: NV-Ingest processing of 6 file(s) has been running for 300.0s (timeout: 300s)
    
  • Health sample:
    HTTP/1.1 503
    {"ingest_ready":true,"pipeline_ready":true,"paddle_ready":false,"yolox_graphic_elements_ready":true,"yolox_page_elements_ready":true,"yolox_table_structure_ready":true}
    

Expected behavior

  • Either:
    • API rejects submit_job for tasks that require Paddle (or other unready components) with 503 and a clear message, or
    • Client fails fast with a clear error when /v1/health/ready reports unready dependencies, or
    • Client times out after a finite number of retries by default, returning an actionable error.

Analysis (suspected root cause)

  • Health endpoint exposes Paddle readiness:
READY_CHECK_ENV_VAR_MAP = {
    "paddle": "PADDLE_HTTP_ENDPOINT",
    "yolox_graphic_elements": "YOLOX_GRAPHIC_ELEMENTS_HTTP_ENDPOINT",
    "yolox_page_elements": "YOLOX_HTTP_ENDPOINT",
    "yolox_table_structure": "YOLOX_TABLE_STRUCTURE_HTTP_ENDPOINT",
}
  • Paddle endpoint defaults (when present) are read from env:
paddle_http_endpoint: str = os.getenv("PADDLE_HTTP_ENDPOINT", "https://ai.api.nvidia.com/v1/cv/baidu/paddleocr")
paddle_infer_protocol: str = os.getenv("PADDLE_INFER_PROTOCOL", "http")
  • The client’s Ingestor.ingest defaults to infinite retries:
DEFAULT_TIMEOUT: int = 100
DEFAULT_MAX_RETRIES: int = None
DEFAULT_VERBOSE: bool = False
timeout: int = kwargs.pop("timeout", DEFAULT_TIMEOUT)
max_job_retries: int = kwargs.pop("max_job_retries", DEFAULT_MAX_RETRIES)
verbose: bool = kwargs.pop("verbose", DEFAULT_VERBOSE)
  • Retry loop requeues jobs on 202 (not ready) when max_job_retries is None:
except TimeoutError:
    self.retry_counts[job_index] += 1
    if self.max_job_retries is None or self.retry_counts[job_index] <= self.max_job_retries:
        # not ready → keep retrying indefinitely when None
        self.retry_job_ids.append(job_index)
    else:
        ...
  • Batch cycle continues while there are retries:
logger.info(f"Starting batch processing for {total_jobs} jobs with batch size {self.batch_size}.")
while (submitted_new_indices_count < total_jobs) or self.retry_job_ids:

Proposed fixes

  • Server-side:
    • Gate /v1/submit_job: if a submission’s task chain requires Paddle/YOLOX/etc. and the corresponding readiness is false, return 503 with explicit dependency details (e.g., “paddle not ready; check PADDLE_HTTP_ENDPOINT/egress/auth”).
    • Optionally transition such jobs to FAILED immediately with a clear reason instead of accepting and then returning 202 indefinitely from fetch.
  • Client-side (nv-ingest-client):
    • Change default max_job_retries from None to a finite number (e.g., 60) to prevent infinite loops by default.
    • Before submission, optionally call /v1/health/ready and fail fast with a user-friendly error if required components are unready (behind a flag that can be enabled by default).
    • When fetch returns 202 for longer than N minutes, surface a clear error citing /v1/health/ready state to guide remediation.

Workarounds for users (until fixed)

  • Disable Paddle-dependent tasks (e.g., image/infographic extraction, NV‑Ingest captions) or explicitly set a finite max_job_retries.
  • Ensure PADDLE_HTTP_ENDPOINT and credentials are configured and reachable from within the cluster.

Acceptance criteria

  • Submissions that require an unready dependency are rejected with 503 and a clear message, or client fails fast with an actionable error.
  • Default client behavior does not loop indefinitely when dependencies are down.
  • Documentation clarifies required env vars and readiness gating behavior.

I can file this for you or tailor the wording for your issue tracker template if you share it.

  • Drafted an actionable nv‑ingest issue describing the hang, clear repro, and proposed fixes (server gating and client defaults).
  • Highlighted code points where infinite retries and readiness checks are defined, and included health evidence showing Paddle unready.

Minimum reproducible example


Relevant log output


Other/Misc.

No response

devinbost avatar Sep 22 '25 23:09 devinbost

Another (slightly related) issue I want to surface: I've noticed in general (from past successful runs) that nv-ingest will just sit there without any logging even when it's processing successfully, which makes it really hard to know when it's having issues. I often need to wait tens of minutes before I know if it's having an issue with a batch.

devinbost avatar Sep 23 '25 11:09 devinbost

I just discovered that our helm upgrade hadn't fully completed, so nv-ingest pods were still at the previous release version. I just upgraded and will try it again on 25.9.0

devinbost avatar Sep 23 '25 16:09 devinbost

It appears that the infinite loop behavior still exists, though we fixed the paddle issue. It seems that if anything causes a problem during ingestion (embedding service unreachable, token limit exceeded, etc.) for any transient issue, nv-ingest will just spin forever without progressing or logging any issues.

devinbost avatar Sep 23 '25 18:09 devinbost