GPU-Accelerated Batching for pages of a PDF during Inference
Question
Is there a way we can improve inference latency of Docling on a GPU by creating a batch of page images as an input to the different models - EasyOCR, Layout Detection and TableFormer?
I am using a single A10 GPU for inference, and it is significantly underutilized (~15%). It would be ideal if we can batch
Looking into the Docling documentation, I have tried increasing num_threads, but that seems to only work for CPU and not GPUs.
When I did a little digging into the code I saw that docling iterates over the pages in a page_batch only passes a single page as an input to these models like so:
def __call__(
self, conv_res: ConversionResult, page_batch: Iterable[Page]
) -> Iterable[Page]:
for page in page_batch:
assert page._backend is not None
if not page._backend.is_valid():
yield page
else:
with TimeRecorder(conv_res, "layout"):
assert page.size is not None
page_image = page.get_image(scale=1.0)
assert page_image is not None
clusters = []
for ix, pred_item in enumerate(
self.layout_predictor.predict(page_image)
):
label = DocItemLabel(
pred_item["label"]
.lower()
.replace(" ", "_")
.replace("-", "_")
)
........
........
It would be great if we can do batching of the page images and maximize the GPU capabilities. Looking forward to hearing back! Thank you!
+1
+1
If you manage your queue correctly to avoid oom you can run multiple processes in parallel in docling serve. It requires you split the pdf from the client. You will get approx x2 performances but it comes with lots of
make only sense with a lot of data ...
simplest way is to start docling on (x) cores (multioprocessing). so it load x times the models in your VRAM. so you can process x times PDF files in parallel. if you first split all PDF into one page would be i bit more efficient ... ;) for me it works ...
it doesn't work IRL (tested and ran in prod during few weeks), first memory foot print is not homogeneous from a file to another and you need to be far from RAM limits to avoid OOM, meaning in our case 3 processes per 24Gb GPU max, and load balancing of requests among processes is kind of the opposite of what is needed for low latency.
Basically, FastAPI used in docling serve is built on top of Gunicorn, which itself delegates load balancing to Linux Kernel (https://docs.gunicorn.org/en/latest/design.html?utm_source=chatgpt.com#how-many-workers) :
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
Therefore, master threrad binds one socket, then forks n workers that inherit that socket. Because of that workers that are already busy stay in the event-loop longer, so they re-enter later and often get more of the new work!!! Nice discussion on this topic : https://github.com/encode/uvicorn/discussions/2467
At the end of the day, queue is the only way to manage things in an optimized way.
for me it runs 12 times on 16GB Vram
pipeline_options = PdfPipelineOptions()
pipeline_options.accelerator_options = accelerator_options
pipeline_options.do_ocr = True (or False i dont remeber)
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
so if you have 100 different PDF it is at least 10 times faster than with one docling process on one coce ... it's a pity that since I have a new environment, it doesn't work on the GPU at all... any hints ? minmum python version? minimum/maximum cuda version ??? any other imports needed?