docling icon indicating copy to clipboard operation
docling copied to clipboard

Support concurrency

Open liepinlxy opened this issue 1 year ago • 7 comments

Can multi-threading be supported for document conversion to shorten the conversion time?

INFO:docling.document_converter:Finished converting page batch time=8.315 INFO:docling.document_converter:Finished converting page batch time=9.202 INFO:docling.document_converter:Finished converting page batch time=5.875 INFO:docling.document_converter:Finished converting page batch time=5.151 INFO:docling.document_converter:Finished converting page batch time=1.420 INFO:docling.document_converter:Finished converting document time-pages=30.02/17

liepinlxy avatar Sep 30 '24 03:09 liepinlxy

@liepinlxy Yes, this will be supported in the near future. We are first focussing on throughput (as much files as possible) instead of fast time-to-conversion.

PeterStaar-IBM avatar Sep 30 '24 10:09 PeterStaar-IBM

@liepinlxy, additionally you can set batch concurrency settings [here], to speedup your batch conversion (https://github.com/DS4SD/docling/blob/main/docling/datamodel/settings.py)

maxmnemonic avatar Sep 30 '24 11:09 maxmnemonic

@maxmnemonic , Thank you. I observe that it is not utilized in the code, and the note indicates: The use of the thread pool is disabled. image

liepinlxy avatar Oct 08 '24 12:10 liepinlxy

Hi @PeterStaar-IBM!

First of all, thank you, the Deep Search team and IBM for open-sourcing Docling! The results are truly mesmerizing, and the fact that it's free to use is amazing. You guys are awesome!

Regarding the topic, do you have any updates on when this feature might be available? Document conversion is currently quite slow, and multithreading would make a significant difference.

jmmfcoutinho avatar Dec 10 '24 14:12 jmmfcoutinho

@jmmfcoutinho We have/will release(d) a few new updates to deal with speed:

  1. v2.10.0: new PDF parser backend with 10x speedup. We believe this will allow 30% faster pdf-processing
  2. v2.11.0: new ability to use accelerators more efficiently. This should accelerate docling by another couple factors

PeterStaar-IBM avatar Dec 10 '24 15:12 PeterStaar-IBM

Hi @PeterStaar-IBM,

Thank you for your (super fast) response. A 10x improvement is indeed impressive! I wasn't aware of this enhancement, so I appreciate the information.

I have two follow up questions then:

  1. Could you please provide guidance (example code or a file link) on configuring the system to utilize the v2.10.0 backend?
  2. Also, any rough estimate to when v2.11.0 might come out?

jmmfcoutinho avatar Dec 10 '24 17:12 jmmfcoutinho

Hi @PeterStaar-IBM,

Thank you for your (super fast) response. A 10x improvement is indeed impressive! I wasn't aware of this enhancement, so I appreciate the information.

I have two follow up questions then:

  1. Could you please provide guidance (example code or a file link) on configuring the system to utilize the v2.10.0 backend?
  2. Also, any rough estimate to when v2.11.0 might come out?

A 10x improvement is indeed impressive!: look here: https://github.com/DS4SD/docling-parse?tab=readme-ov-file#performance-benchmarks

For 1, you need to just repin the version of docling or reinstall the latest (eg poetry add docling@latest). The v2.11 should come out later this week.

PeterStaar-IBM avatar Dec 11 '24 06:12 PeterStaar-IBM

Hello @PeterStaar-IBM , Thanks for the great work and contribution to the community with Docling. Such an amazing tool. A quick question though. Currently Docling relies on PyPDFium to read PDF in the document_converter. This will still make Docling a non-thread-safe choice. Is there an alternative to use here or is there anything upcoming in your roadmap that might be a potential solution to this? Thanks!

dil-sjabbar avatar Jan 03 '25 18:01 dil-sjabbar

@liepinlxy, additionally you can set batch concurrency settings [here], to speedup your batch conversion (https://github.com/DS4SD/docling/blob/main/docling/datamodel/settings.py)

I also attempted to increase the following settings:

class BatchConcurrencySettings(BaseModel):
    doc_batch_size: int = 2
    doc_batch_concurrency: int = 2
    page_batch_size: int = 4
    page_batch_concurrency: int = 2
    elements_batch_size: int = 16

However, the speed did not improve. Eventually, I discovered that parallel processing is still disabled.

Screenshot from 2025-01-14 11-18-24

So, any solutions for this?

sunil448832 avatar Jan 14 '25 05:01 sunil448832

@sunil448832 we are not looking into in-process parallelization yet, because there is nothing to gain from it. This was what we took from early experiments. Everything computed in docling is either subject to the GIL, or delegated to torch and other runtimes, which already exploit internal parallelism. There may be opportunity with python3.13 free-threaded version, once that is mature enough. We will revisit the topic when the time arrives.

cau-git avatar Jan 30 '25 14:01 cau-git

@cau-git Is this issue supposed to be closed? It doesn't seem like so, from your last message.

sanmai-NL avatar Feb 13 '25 07:02 sanmai-NL

Can someone help me how to increase the doc_batch size. I am passing 16 documents path to the converter and only 2 is being processed, it is very slow in converting large number of file. Can anyone help me with a example code on how to do the batch conversion.

amal5haji avatar Feb 20 '25 17:02 amal5haji

@liepinlxy, additionally you can set batch concurrency settings [here], to speedup your batch conversion (https://github.com/DS4SD/docling/blob/main/docling/datamodel/settings.py)

How to increase this? can u help

amal5haji avatar Feb 20 '25 17:02 amal5haji

Hello! Was anyone able to parallelise the converter execution?

mariaapostigo avatar Jun 19 '25 15:06 mariaapostigo

Hello! Was anyone able to parallelise the converter execution?

You could split large documents to parts, like 10 pages each and run workers in child processes to process them in parallel. But in this case you will have xN memory consumption. I did it recently in https://github.com/artiz/kate-chat/pull/23 (changes in document_processor)

artiz avatar Nov 21 '25 17:11 artiz