docling icon indicating copy to clipboard operation
docling copied to clipboard

Multi-threading and multi-processing for faster parsing

Open 17Reset opened this issue 9 months ago • 3 comments

Question

I am using the latest version of Docling to convert a PDF of about 300 pages that has more images, tables, code blocks.It takes a lot of time to complete the conversion of a file, can I improve the speed effectively

Python Version: 3.12.3
docling Version: 2.28.2
OS: Ubuntu 24.04.2 LTS
(docling_env) xlab@xlab:/mnt/Agent/Docling$ docling --pipeline vlm --vlm-model smoldocling --device cuda -vv --from pdf --to md --num-threads 64 ug901-vivado-synthesis.pdf
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Initializing pipeline for VlmPipeline with options hash 46a3b6d819846b105bc3bda802a8f22e
INFO:docling.utils.accelerator_utils:Accelerator device: 'cuda:0'
DEBUG:docling.models.hf_vlm_model:Available device for HuggingFace VLM: cuda:0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/ds4sd/SmolDocling-256M-preview/revision/main HTTP/1.1" 200 4311
INFO:docling.pipeline.base_pipeline:Processing document ug901-vivado-synthesis.pdf
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82009.799
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82052.123
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82110.161
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82160.005
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82202.781
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82257.541
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82297.025
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82396.711
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82431.209
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82492.306
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82547.519
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82597.188
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82660.288
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82773.034
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82837.877
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=82906.373
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83022.037
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83081.207
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83126.706
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83173.282
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83235.069
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83445.735
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83510.407
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83568.254
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83639.581
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83688.871
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83746.215
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=83917.868
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84021.543
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84092.955
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84153.578
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84203.174
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84267.182
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84324.967
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84382.349
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84434.755
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84485.771
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84538.655
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84648.840
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84771.165
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84824.692
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84871.561
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84928.811
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=84980.159
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85051.836
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85114.926
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85177.891
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85302.097
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85435.572
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85492.312
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85557.887
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85623.079
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85670.693
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85722.790
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85789.051
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85846.423
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85889.070
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=85948.641
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86009.567
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86069.455
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86117.787
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86179.873
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86234.340
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86296.471
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86357.809
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86468.569
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86534.276
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86589.226
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86652.891
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86765.567
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86882.041
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86932.738
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=86971.571
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=87034.764
DEBUG:docling.pipeline.base_pipeline:Finished converting page batch time=87096.757
INFO:docling.document_converter:Finished converting document ug901-vivado-synthesis.pdf in 5145.76 sec.
INFO:docling.cli.main:writing Markdown output to ug901-vivado-synthesis.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 5201.73 seconds.

17Reset avatar Mar 28 '25 03:03 17Reset

I have the same question with the vlm pipeline. When converting one example page PDF(with 2-3 formulas), it takes about one minute (very low gpu/cpu utilization).

Image

The result is very GOOD!!!.

Chapter 1 Elementary Probability Theory

We call elementary probability theory that part of probability theory which deals with probabilities of only a finite number of events. A. N. Kolmogorov, "Foundations of the Theory of Probability" [51]

1 Probabilistic Model of an Experiment with a Finite Number of Outcomes

    1. Let us consider an experiment of which all possible results are included in a finite number of outcomes ω$_{1}$, . . . , ω$_{N}$ . We do not need to know the nature of these outcomes, only that there are a finite number of N of them.

We call ω$_{1}$, . . . , ω$_{N}$ elementary events, or sample points, and the finite set

$$\Omega = { \omega _ { 1 }, \dots, \omega _ { N } },$$

the (finite) space of elementary events or the sample space.

The choice of the space of elementary events is the first step in formulating a probabilistic model for an experiment. Let us consider some examples of sample spaces.

Example 1. For a single toss of a coin the sample space Ω consists of two points:

$$\Omega = { H, T },$$

where H = "head" and T = "tail."

Example 2. For n tosses of a coin the sample space is

$$\Omega = { \omega \colon \omega = ( a _ { 1 }, \dots, a _ { n } ), , a _ { i } = H , or T }.$$

and the total number N (Ω) of outcomes is 2 $^{n}$.

© Springer Science+Business Media New York 2016 A.N. Shiryaev, Probability-I, Graduate Texts in Mathematics 95, DOI 10.1007/978-0-387-72206-1

qianjiaqiang avatar Mar 29 '25 09:03 qianjiaqiang

Facing the same issue

anthonyyoussef01 avatar Apr 03 '25 22:04 anthonyyoussef01

Same

Cerebex avatar May 27 '25 22:05 Cerebex

@dolfim-ibm @vagenas @cau-git @maxmnemonic @ceberam @PeterDaveHello @nikos-livathinos So sorry to tag you guys here, but I was wondering if there are any recommendations or updates on this old ticket. Thank you!

anthonyyoussef01 avatar Jun 16 '25 11:06 anthonyyoussef01

@anthonyyoussef01 We will soon start activities to address performance related issues more holistically, we will take the above comments into consideration.

cau-git avatar Jun 18 '25 08:06 cau-git

Thank you so much!

anthonyyoussef01 avatar Jun 18 '25 12:06 anthonyyoussef01

@anthonyyoussef01 We will soon start activities to address performance related issues more holistically, we will take the above comments into consideration.

Any update on this issue or any issue created for this? I would love to contribute too. Thank you

mandar-karhade avatar Oct 28 '25 17:10 mandar-karhade