OCRmyPDF
OCRmyPDF copied to clipboard
OCR on pages 2+ is only recognized in browsers but not by poppler/etree unless done in two steps
When I run ocrmypdf on a PDF where the first page is already OCRed (with a footer OCRed on every page), I try to force OCR on pages 2–99999999. Although the resulting PDF has selectable text on all pages in a browser, when I process it in Python (using a library based on poppler and etree), only the text from page 1 is accessible.
ocrmypdf --pages 2-99999999 --tesseract-timeout 3600 --oversample 300 --clean -l vie+eng --force-ocr\
VN/VN---3335Y_Y.fs.pdf \
VNsearchable/VN---3335Y_Y.fs.pdf`
Workaround Remove OCR layer on pages 2+ and produce an intermediate PDF:
ocrmypdf --pages 2-99999999 --tesseract-timeout 0 --oversample 300 --clean -l vie+eng --force-ocr \
VN/VN---3335Y_Y.fs.pdf \
VNsearchable/VN---3335Y_Y.fs-step1.pdf
Then run with --skip-text on the intermediate file:
ocrmypdf --tesseract-timeout 3600 --oversample 300 --clean -l vie+eng --skip-text \
VNsearchable/VN---3335Y_Y.fs-step1.pdf \
VNsearchable/VN---3335Y_Y.pdf
Now all pages’ text (including page 2 onwards) is fully accessible via poppler + etree.
Expected Behavior
Forcing OCR on pages 2+ in one command should yield the same PDF as doing it in two steps (first removing the OCR layer on pages 2+ then re-running OCR with --skip-text).
If I’ve misunderstood anything or missed any important detail, please let me know — I really appreciate your help in troubleshooting this!