docling
docling copied to clipboard
feat: Add timeout limit to document parsing job. #270
Testing: (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 87584.07it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 24.12 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 24.13 seconds.
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 29037.49it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf WARNING:docling.pipeline.base_pipeline:Document processing time (6 s) exceeded the specified timeout of 5 s INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.82 sec. WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpzedg349h/2206.01062v1.pdf failed to convert. INFO:docling.cli.main:Processed 1 docs, of which 1 failed INFO:docling.cli.main:All documents were converted in 10.82 seconds.
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 88197.98it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 22.59 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 22.60 seconds.
(.venv) mario@Abhisheks-MacBook-Air docling % docling
Usage: docling [OPTIONS] source
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --from [docx|pptx|html|image|pdf|asciidoc|md] Specify input formats to convert from. Defaults to all formats. [default: None] │ │ --to [md|json|text|doctags] Specify output formats. Defaults to Markdown. [default: None] │ │ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │ │ --force-ocr --no-force-ocr Replace any existing text with OCR generated text over the full content. [default: no-force-ocr] │ │ --ocr-engine [easyocr|tesseract_cli|tesseract] The OCR engine to use. [default: easyocr] │ │ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. [default: dlparse_v1] │ │ --table-mode [fast|accurate] The mode to use in the table structure model. [default: fast] │ │ --artifacts-path PATH If provided, the location of the model artifacts. [default: None] │ │ --abort-on-error --no-abort-on-error If enabled, the bitmap content will be processed using OCR. [default: no-abort-on-error] │ │ --output PATH Output directory where results are saved. [default: .] │ │ --version Show version information. │ │ --document-timeout INTEGER The timeout for processing each document, in seconds. [default: None] │ │ --help Show this message and exit. │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Resolves #270
Checklist:
- [ ] Commit Message Formatting: Commit titles and messages follow guidelines in the conventional commits.
- [ ] Documentation has been updated, if necessary.
- [ ] Examples have been added, if necessary.
- [ ] Tests have been added, if necessary.
@ab-shrek thank you for your PR. Here are some points:
- It would be better to measure the timeout in fractional seconds. Then the type of
pipeline_options.document_timeoutshould be float. - The signature of the
_build_document()method must remain as it is. Thepipeline_optionsare already available as a member of thePaginatedPipelineclass, no need to pass the timeout as an additional method parameter. You can just use theself._pipeline_options.document_timeoutinside thePaginatedPipeline._build_document(). - When measuring elapsed time it is better to use
time.monotonic()instead oftime.time(). It would be nice to improve also the existing code and use themonotonic()for the measurement of the elapsed time of page batch used for the logging. - Another approach is to measure the elapsed time for the page batch only once inside the loop and sum up to have the elapsed time for the document conversion up to the current page batch.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
@ab-shrek thank you for taking the time to implement my comments. It would be nice to follow the next steps:
- Update your forked repo with the upstream Docling main.
- Install and run the
pre-commit(https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md#coding-style-guidelines)
MyPy complains about 2 issues:
- The
total_elapsed_timeshould be float. @ab-shrek please initialize it astotal_elapsed_time = 0.0. - The
document_timeoutshould be defined in the mother classPipelineOptionsand not inPdfPipelineOptions. I propose to actually move thedocument_timeoutin the base classPipelineOptionseven though only thePaginatedPipelinewill currently provide an implementation. @cau-git please comment on this.
@ab-shrek as we want to merge this PR soon, please feel free to implement the steps from my previous comment.
@nikos-livathinos let me get on this asap. Thanks
@ab-shrek very nice, now all tests pass!
You only need to update again your forked repo with the upstream Docling as there are some conflicts in cli/main.py.
I guess after this we are ready to merge this PR.