Support concurrency
Can multi-threading be supported for document conversion to shorten the conversion time?
INFO:docling.document_converter:Finished converting page batch time=8.315 INFO:docling.document_converter:Finished converting page batch time=9.202 INFO:docling.document_converter:Finished converting page batch time=5.875 INFO:docling.document_converter:Finished converting page batch time=5.151 INFO:docling.document_converter:Finished converting page batch time=1.420 INFO:docling.document_converter:Finished converting document time-pages=30.02/17
@liepinlxy Yes, this will be supported in the near future. We are first focussing on throughput (as much files as possible) instead of fast time-to-conversion.
@liepinlxy, additionally you can set batch concurrency settings [here], to speedup your batch conversion (https://github.com/DS4SD/docling/blob/main/docling/datamodel/settings.py)
@maxmnemonic , Thank you. I observe that it is not utilized in the code, and the note indicates: The use of the thread pool is disabled.
Hi @PeterStaar-IBM!
First of all, thank you, the Deep Search team and IBM for open-sourcing Docling! The results are truly mesmerizing, and the fact that it's free to use is amazing. You guys are awesome!
Regarding the topic, do you have any updates on when this feature might be available? Document conversion is currently quite slow, and multithreading would make a significant difference.
@jmmfcoutinho We have/will release(d) a few new updates to deal with speed:
- v2.10.0: new PDF parser backend with 10x speedup. We believe this will allow 30% faster pdf-processing
- v2.11.0: new ability to use accelerators more efficiently. This should accelerate docling by another couple factors
Hi @PeterStaar-IBM,
Thank you for your (super fast) response. A 10x improvement is indeed impressive! I wasn't aware of this enhancement, so I appreciate the information.
I have two follow up questions then:
- Could you please provide guidance (example code or a file link) on configuring the system to utilize the v2.10.0 backend?
- Also, any rough estimate to when v2.11.0 might come out?
Hi @PeterStaar-IBM,
Thank you for your (super fast) response. A 10x improvement is indeed impressive! I wasn't aware of this enhancement, so I appreciate the information.
I have two follow up questions then:
- Could you please provide guidance (example code or a file link) on configuring the system to utilize the v2.10.0 backend?
- Also, any rough estimate to when v2.11.0 might come out?
A 10x improvement is indeed impressive!: look here: https://github.com/DS4SD/docling-parse?tab=readme-ov-file#performance-benchmarks
For 1, you need to just repin the version of docling or reinstall the latest (eg poetry add docling@latest). The v2.11 should come out later this week.
Hello @PeterStaar-IBM , Thanks for the great work and contribution to the community with Docling. Such an amazing tool. A quick question though. Currently Docling relies on PyPDFium to read PDF in the document_converter. This will still make Docling a non-thread-safe choice. Is there an alternative to use here or is there anything upcoming in your roadmap that might be a potential solution to this? Thanks!
@liepinlxy, additionally you can set batch concurrency settings [here], to speedup your batch conversion (https://github.com/DS4SD/docling/blob/main/docling/datamodel/settings.py)
I also attempted to increase the following settings:
class BatchConcurrencySettings(BaseModel):
doc_batch_size: int = 2
doc_batch_concurrency: int = 2
page_batch_size: int = 4
page_batch_concurrency: int = 2
elements_batch_size: int = 16
However, the speed did not improve. Eventually, I discovered that parallel processing is still disabled.
So, any solutions for this?
@sunil448832 we are not looking into in-process parallelization yet, because there is nothing to gain from it. This was what we took from early experiments. Everything computed in docling is either subject to the GIL, or delegated to torch and other runtimes, which already exploit internal parallelism. There may be opportunity with python3.13 free-threaded version, once that is mature enough. We will revisit the topic when the time arrives.
@cau-git Is this issue supposed to be closed? It doesn't seem like so, from your last message.
Can someone help me how to increase the doc_batch size. I am passing 16 documents path to the converter and only 2 is being processed, it is very slow in converting large number of file. Can anyone help me with a example code on how to do the batch conversion.
@liepinlxy, additionally you can set batch concurrency settings [here], to speedup your batch conversion (https://github.com/DS4SD/docling/blob/main/docling/datamodel/settings.py)
How to increase this? can u help
Hello! Was anyone able to parallelise the converter execution?
Hello! Was anyone able to parallelise the converter execution?
You could split large documents to parts, like 10 pages each and run workers in child processes to process them in parallel. But in this case you will have xN memory consumption. I did it recently in https://github.com/artiz/kate-chat/pull/23 (changes in document_processor)