Add timeout limit to document parsing job.
Requested feature
We need to have a way to add a timeout parameter when processing a document. Currently, it happens in very rare cases that certain documents will take very long to convert. In a batch processing job, this might become problematic.
example use case:
Checking the attached PDF, it is not a surprise we see very long conversion time. It is fully scanned and has a lot of pages, which is very slow on CPU at least.
Generally, there are multiple strategies to avoid such samples clogging a bulk conversion pipeline.
- One can run over all docs with OCR off, and later rerun only those docs where the conversion result is empty (i.e. it may need OCR). Already possible with current version.
- We can extend docling to optionally stop converting a doc when a timeout is reached. This timeout can only be checked once after every next page batch (i.e. after multiples of 4 pages with the defaults). This would reflect as a status
PARTIAL_SUCCESS. User code could either export the partial result or drop the document.
I am interested in this issue. Can you please assign this to me? Thanks :)
Are you working on this @nikos-livathinos ?
@ab-shrek great to see you are interested in helping out on this issue. Please submit a PR for our review. Here are some hints:
- Introduce a new parameter (e.g.
pdf_document_timeout) inPdfPipelineOptions(https://github.com/DS4SD/docling/blob/c6b3763ecb6ef862840a30978ee177b907f86505/docling/datamodel/pipeline_options.py#L71) - Implement the timeout logic in the
PaginatedPipeline._build_document()(https://github.com/DS4SD/docling/blob/c6b3763ecb6ef862840a30978ee177b907f86505/docling/pipeline/base_pipeline.py#L118)- The timeout should apply to the PDF pipeline for the time needed to convert the entire document.
- We should check for a timeout after the conversion of each page chunk (but the check is for the document not only for the current page chunk).
- When a timeout happens, the loop exits and the
conv_res.statusshould set toConversionStatus.PARTIAL_SUCCESS.
- Extend the docling CLI (https://github.com/DS4SD/docling/blob/main/docling/cli/main.py) to expose a cmd argument (e.g.
--document-timeout) that sets thepdf_document_timeoutinside thePdfPipelineOptions.
Great; thanks @nikos-livathinos. Let me get on this asap :)
This feature has been implemented in this PR https://github.com/DS4SD/docling/pull/552