OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

[Bug]: OCR Output Quality Regression on Ubuntu 24.04

Open guilhermebferreira opened this issue 1 year ago • 3 comments

What were you trying to do?

After upgrading my container's base image from Ubuntu 22.04 to Ubuntu 24.04, I started experiencing minor but consistent issues with the OCR output generated by ocrmypdf.

I have a suite of unit tests that uses the OCR that have been stable for some time, but some of these tests started failing after the upgrade. These tests use a PDF file as input, and compare the result with an expected output.

My requirements.txt file looks like below:


cffi==1.17.1
charset-normalizer==3.4.0
cryptography==44.0.0
deprecated==1.2.15
deprecation==2.1.0
grpcio==1.68.1
img2pdf==0.5.1
lxml==5.3.0
markdown-it-py==3.0.0
mdurl==0.1.2
ocrmypdf==16.6.2
packaging==24.2
pdfminer-six==20240706
pi-heif==0.21.0
pikepdf==9.4.2
pillow==11.0.0
pluggy==1.5.0
protobuf==3.20.3
pycparser==2.22
pygments==2.18.0
rich==13.9.4
typing-extensions==4.12.2
wrapt==1.17.0

And I'm running ocrmypdf with the following params:

ocrmypdf {inputpdf} {outputpdf} --force-ocr --pages 1,2 --optimize 0 --tesseract-pagesegmode 6 --pdf-renderer 'hocr' --sidecar {outputtxt}

Environment Details:

  • OS: Ubuntu 24.04
  • System Packages:
    • Ghostscript: 10.02.1
    • Tesseract: 5.3.4
    • pngquant: 2.18.0
    • Unpaper: 7.0.0

Expected Behavior:

OCR output should match the behavior observed when using Ubuntu 22.04, producing accurate and consistent text output.

Observed Behavior:

Misrecognized characters (e.g., "p" becomes "o"). Additional spaces introduced in the OCR output. Additional Information:

The same setup works perfectly when using Ubuntu 22.04.

Where are you installing/running from?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.6.2

What operating system are you working on?

Linux

Operating system details and version

Ubuntu 24.04

Simple sanity checks

  • [X] Operating system is currently supported by its vendor (not end of life)
  • [X] Python version is compatible with OCRmyPDF
  • [X] This issue is not about a specific input file

Relevant log output

No response

guilhermebferreira avatar Dec 02 '24 23:12 guilhermebferreira

I did some experiments. The main difference is that Ubuntu 24.04 provides a different version of Ghostscript, 10.x, while 22.04 provides 9.55. There was a major rewrite of PDF handling between 9.x and 10.x in Ghostscript, and the new version is significantly lower in quality from an OCR perspective -- 10.x produces output that most PDF viewers will see as extra word breaks in the middle of words.

In a challenging test document, Ghostscript 10 produces "Al l f i xt ur es and har dwar e wi l l be pr oper l y and s ecur el y i ns t al l ed." (3 words identified correctly) while Ghostscript 9 produces "All fi xtures and h ardware wi ll be properly and securely i nstalled." (6 words identified correctly, still not great)

jbarlow83 avatar Dec 05 '24 10:12 jbarlow83

@stumpylog I think paperless-ngx should consider pinning Ghostscript 9, based on my findings so far.

jbarlow83 avatar Dec 06 '24 21:12 jbarlow83

so we are facing the same problem on debian..lots of additional spaces. Any solutions yet?

yinghui-wang avatar Sep 01 '25 01:09 yinghui-wang