docling Add timeout limit to document parsing job.

Requested feature

We need to have a way to add a timeout parameter when processing a document. Currently, it happens in very rare cases that certain documents will take very long to convert. In a batch processing job, this might become problematic.

example use case:

temp.pdf

Nov 07 '24 08:11 PeterStaar-IBM

Checking the attached PDF, it is not a surprise we see very long conversion time. It is fully scanned and has a lot of pages, which is very slow on CPU at least.

Generally, there are multiple strategies to avoid such samples clogging a bulk conversion pipeline.

One can run over all docs with OCR off, and later rerun only those docs where the conversion result is empty (i.e. it may need OCR). Already possible with current version.
We can extend docling to optionally stop converting a doc when a timeout is reached. This timeout can only be checked once after every next page batch (i.e. after multiples of 4 pages with the defaults). This would reflect as a status PARTIAL_SUCCESS. User code could either export the partial result or drop the document.

Nov 07 '24 13:11 cau-git

I am interested in this issue. Can you please assign this to me? Thanks :)

Nov 11 '24 00:11 ab-shrek

Are you working on this @nikos-livathinos ?

Nov 11 '24 09:11 ab-shrek

@ab-shrek great to see you are interested in helping out on this issue. Please submit a PR for our review. Here are some hints:

Introduce a new parameter (e.g. pdf_document_timeout) in PdfPipelineOptions (https://github.com/DS4SD/docling/blob/c6b3763ecb6ef862840a30978ee177b907f86505/docling/datamodel/pipeline_options.py#L71)
Implement the timeout logic in the PaginatedPipeline._build_document() (https://github.com/DS4SD/docling/blob/c6b3763ecb6ef862840a30978ee177b907f86505/docling/pipeline/base_pipeline.py#L118)
- The timeout should apply to the PDF pipeline for the time needed to convert the entire document.
- We should check for a timeout after the conversion of each page chunk (but the check is for the document not only for the current page chunk).
- When a timeout happens, the loop exits and the conv_res.status should set to ConversionStatus.PARTIAL_SUCCESS.
Extend the docling CLI (https://github.com/DS4SD/docling/blob/main/docling/cli/main.py) to expose a cmd argument (e.g. --document-timeout ) that sets the pdf_document_timeout inside the PdfPipelineOptions.

Nov 12 '24 11:11 nikos-livathinos

Great; thanks @nikos-livathinos. Let me get on this asap :)

Nov 12 '24 15:11 ab-shrek

This feature has been implemented in this PR https://github.com/DS4SD/docling/pull/552

Dec 11 '24 14:12 nikos-livathinos