docling
docling copied to clipboard
Usage of force_full_page_ocr breaks with larger documents
Bug
For longer documents with a corrupted text layer, the 'force_full_page_ocr' parameter breaks after a few pages. The corrupted text layer continues to be extracted instead of forcing the OCR. Tested with EasyOCR ( + Tesseract)
Steps to reproduce
Usage of a longer pdf (10 parges or more) with a corrupt text layer
Docling version
2.29.0 ...
Python version
3.11.12 / 3.13
The ocr results just works if a cast with fpdf to pngs -> after back to pdf, takes place before to delete the corrupt ocr
@dghoffra can you please provide more details to reproduce this? I would like to understand the exact settings and an input PDF which exposes the problem.