OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

OCR on pages 2+ is only recognized in browsers but not by poppler/etree unless done in two steps

Open jribault opened this issue 1 month ago • 0 comments

When I run ocrmypdf on a PDF where the first page is already OCRed (with a footer OCRed on every page), I try to force OCR on pages 2–99999999. Although the resulting PDF has selectable text on all pages in a browser, when I process it in Python (using a library based on poppler and etree), only the text from page 1 is accessible.

VN---3335Y_Y.fs.pdf

ocrmypdf --pages 2-99999999 --tesseract-timeout 3600 --oversample 300 --clean -l vie+eng --force-ocr\
  VN/VN---3335Y_Y.fs.pdf \
 VNsearchable/VN---3335Y_Y.fs.pdf`

VN---3335Y_Y.fs-problem.pdf

Workaround Remove OCR layer on pages 2+ and produce an intermediate PDF:

ocrmypdf --pages 2-99999999 --tesseract-timeout 0 --oversample 300 --clean -l vie+eng --force-ocr \
    VN/VN---3335Y_Y.fs.pdf \
    VNsearchable/VN---3335Y_Y.fs-step1.pdf

VN---3335Y_Y.fs-step1.pdf

Then run with --skip-text on the intermediate file:

ocrmypdf --tesseract-timeout 3600 --oversample 300 --clean -l vie+eng --skip-text \
    VNsearchable/VN---3335Y_Y.fs-step1.pdf \
    VNsearchable/VN---3335Y_Y.pdf

VN---3335Y_Y.fs.pdf

Now all pages’ text (including page 2 onwards) is fully accessible via poppler + etree.

Expected Behavior

Forcing OCR on pages 2+ in one command should yield the same PDF as doing it in two steps (first removing the OCR layer on pages 2+ then re-running OCR with --skip-text).

If I’ve misunderstood anything or missed any important detail, please let me know — I really appreciate your help in troubleshooting this!

jribault avatar Jan 14 '25 14:01 jribault