Christoph Auer comments

Results 170 comments of


                                            Christoph Auer

Docling 2.10.0: Performance Degradation When Reading Large PDF Files

@langzichai Several improvements, especially for GPU acceleration and layout processing, were released since you last reported, would you mind checking again with docling==2.14.0?

Docling 2.10.0: Performance Degradation When Reading Large PDF Files

I am concluding that this is no longer a concern with the recent docling versions. Please feel free to re-open if you have new evidence of slow-downs. Thanks.

docling parsing for scanned pdfs wont detect white space between words.

@hisan-ideamaker it is likely a limitation of the RapidOCR performance with english/latin material in PP-OCR v5 models. You have the choice of going back to EasyOCR which was the previous...

Sentence-level bounding boxes are too large - include entire paragraphs instead of chunk text

@sebihoefle The bounding boxes docling infers for elements on a page are paragraph-scoped for text. If a chunk is created with a subset of a paragraph (e.g. sentence level), it...

What's your definition of SOON re: Metadata extraction, including title, authors, references & language

You can test the docling extraction pipeline for this: https://docling-project.github.io/docling/examples/extraction/

Image formats not generating picture descriptions, only OCR text extraction

@JViktoRArtola the main reason you see this is because "full-page pictures" are mostly classified as background art. The picture description works if the picture is embedded in a natural context...

Usage of force_full_page_ocr breaks with larger documents

@dghoffra can you please provide more details to reproduce this? I would like to understand the exact settings and an input PDF which exposes the problem.

chore: change options pydantic schema to base options

@simjak I agree we need a fix for RapidOcr, but I would like to have `RapidOcrOptions` in the Union instead. I think it is necessary for discovery of legal CLI...

chore: change options pydantic schema to base options

@simjak we will close this in favour of https://github.com/DS4SD/docling/pull/544 which includes a fix.

docling is not providing better results for arabic language

@mudassir206 We have so far taken care of correct representation of arabic script from _digital PDF text_. For embedded bitmaps (e.g. scanned pages) we currently depend on the capabilities of...