Usage of force_full_page_ocr breaks with larger documents

Open dghoffra opened this issue 7 months ago • 2 comments

Bug

For longer documents with a corrupted text layer, the 'force_full_page_ocr' parameter breaks after a few pages. The corrupted text layer continues to be extracted instead of forcing the OCR. Tested with EasyOCR ( + Tesseract)

Steps to reproduce

Usage of a longer pdf (10 parges or more) with a corrupt text layer

Docling version

2.29.0 ...

Python version

3.11.12 / 3.13

Apr 30 '25 13:04 dghoffra

The ocr results just works if a cast with fpdf to pngs -> after back to pdf, takes place before to delete the corrupt ocr

Apr 30 '25 14:04 dghoffra

@dghoffra can you please provide more details to reproduce this? I would like to understand the exact settings and an input PDF which exposes the problem.

May 21 '25 12:05 cau-git