[Feature] Adding a parallel processing and Multi Threading for docling2parquet
Search before asking
- [x] I searched the issues and found no similar issues.
Component
Python Runtime
Feature
Hello team, I have recently been testing docling2parquet across multiple pdf backends: 1.PyPDFium 2. DLPARSE_V2 3. DLPARSE_V4 All of these backends for ocr on and ocr off. I recently was discussing some issues with @shivdeep-singh-ibm sir when he recommended that the USP of dpk is parallel processing and told me if splitting these pdfs and sending would be a way to make things a bit faster rather than sending the whole 100+ pages at once taking way more time.So I started trying this up if this is possibly the solution,
I was testing this around when I saw https://landing.ai/agentic-document-extraction They claim insane speeds and excellent layout handling. They have some code open source which is not the whole code as they ultimately are chunking the pdfs (splitting in pages) using multithreading and sending it to some API out there hosted on some server which does the processing where as we are on raw CPU compute power
Anyways, I tried the parallel processing part and used Multi Threading to process and convert chunks of split pdfs (5 pages each as of now). This reduced the time a bit but one thing I noticed is that if the pdf would be too huge(1000+ pages) the split should be having more pages so I am trying to adjust this dynamically with a logic, currently it splits the doc type according to the mime type identified via extension and recognised by docling itself
Now coming to the good part- if I turn of OCR and use some good parsing backend like DLPARSE_V2 or DLPARSE_V4 from docling along with multithreading , Even after Global Interpreter Lock or GIL which allows only one thread to execute the python bytecode, The whole 24 page pdf was executed in about 5 seconds along with tables ! Just for reference the same pdf with our current method takes about easy 100seconds for the same pdf and same compute power, That is like 20X speed increase for same quality parsing. This is great if we can get this as a fastmode option maybe for docling2parquet. This also is not disturbing anything in Docling we are just optimizing things a bit how we use docling so I think we wont be required to fix things in docling first.
Now there are few things which I saw during this testing :
- OCR is the real culprit for the slow speed which @sujee pointed out first in #573 . I guessed it back then but after testing things very throughly its 100% that the OCR is causing
- this multiprocessing process suffers when u turn on the OCR, the very simple logic backing this is that OCR was heavy on all the cores of the cpu combined I am sending lower loads but now in multi threading the cores are also single so each set gets a core but it never ends because how computionally expensive OCR is
- This process of multi threading parses the pdf extremely fast as I said with off OCR but one of the drawbacks is the parts of the pdf can come in any order of page set. for example the first 5 pages can come last and last 5 pages can come first solely dependent of who gets a free core first and which one gets proessed first-- I am working on this but currently I dont have anything in mind to solve this. This wont matter a lot for LLMs or RAG based applications as they are going to consume the context in chunks anyways I think
@shahrokhDaijavad @shivdeep-singh-ibm and @sujee I am tagging you all sirs so that I can have your point of view on things and this finding should we have this in dpk or not? or should we take this in some different directions. I would love to discuss this further and answer any queries there :)
I currently tested it for a few pdfs only I will proabably try to include a better logic handling for zips too optimizing the whole thing even further possibly but still it is pretty fast as of now.
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
Thanks for your investigation, @ShiroYasha18. Let me summarize your findings: If we turn the OCR off and add multi-processing/multi-threading, you get a 20X performance improvement, right?
I have seen a way of adding multi-threading to Docling here: https://docling-project.github.io/docling/examples/run_with_accelerator/ Have you tried this? If not, can you please try this and compare it with what you have tested?
I am cc'ing @SohamDasgupta91, as he is also testing Docling in DPK.
Hi, thank you for looping me in. Just a quick question about the chunks being out of order - would it not be possible to index the chunks before the multi threading and reorder them after? Assuming each chunk is chronologically assigned, would this not work?
OCR issue related to #1239
Thank you for your reply, @shahrokhDaijavad !
I have seen a way of adding multi-threading to Docling here: https://docling-project.github.io/docling/examples/run_with_accelerator/ Have you tried this? If not, can you please try this and compare it with what you have tested?
Yes, you are absolutely right. I've been seeing remarkable speeds with the multi-threading approach when OCR is off, leading to approximately a 20x performance increase for docling2parquet. We're talking about around 0.12 seconds per page, including tables.
Regarding your suggestion to try the multi-threading method from the Docling example (https://docling-project.github.io/docling/examples/run_with_accelerator/): I've checked it out and also tested the ````accelerator_options. My understanding is that this is primarily internal parallelism. When the DocumentConverter processes a single document, it uses num_threads``` to allocate multiple CPU 'helpers' or, if available, offloads work to a GPU (device='cuda') to speed up that individual conversion. I tested this on my Mac (without a GPU) using only CPU num_threads, and the per-page speed was about 5.8 seconds with OCR on and 2.3 seconds with OCR off.
What I've been focusing on is essentially an external layer built on top of the existing system. My approach involves explicitly breaking down documents into customizable 'n' page segments, and then dispatching these individual chunks to multiple CPU threads for parallel processing. This method of concurrent chunk handling is what's truly enabling the significant speed gains I'm observing for docling2parquet.
@SohamDasgupta91 , regarding the chunk ordering: You're spot on, the issue is that CPU threads process chunks based on availability, so the completion order is essentially random. Even if we index the chunks with a key-value pair, we still wouldn't know which one would arrive first. Currently, I'm simply writing whichever chunk completes first to the Parquet file. A potential solution could be to wait for all chunks to arrive before writing them in order to the Parquet file. However, this might slightly increase the overall processing time due to the waiting period.
@shahrokhDaijavad , I'd greatly appreciate your inputs on these sir.
I'm still testing, and the quality seems similar so far. However, when I tried one of the PDFs that @sujee recommended (https://github.com/sujee/data-prep-kit/blob/perf-1-pdf2pq/test/perf-pdf2pq/input/Walmart_2024.pdf), it was giving me fonts and other gibberish. Could someone please check this specific PDF with the normal pipeline?
Finally, I've been pondering if chunk ordering truly matters for the ultimate use case. If we can achieve arguably the fastest PDF ingestion in the industry (Landing AI claims 8 seconds for a PDF, and we're potentially doing a 100-page PDF with tables and images in under 5 seconds), and considering LLMs and RAG pipelines often ingest documents in smaller chunks anyway, perhaps strict ordering isn't always critical for the final application.
Looking forward to your thoughts!
Hi @ShiroYasha18,
We've been trying to implement the Docling Accelerator options but have been unable to do so as the prep kit pipeline does not pick up the functions. Could you share the code you used to implement this?
Also, if it is possible for you to share a sample pipeline that you implemented the above features that you discussed in? Would be a great help, thankyou!