[Bug]: OCR Output Quality Regression on Ubuntu 24.04
What were you trying to do?
After upgrading my container's base image from Ubuntu 22.04 to Ubuntu 24.04, I started experiencing minor but consistent issues with the OCR output generated by ocrmypdf.
I have a suite of unit tests that uses the OCR that have been stable for some time, but some of these tests started failing after the upgrade. These tests use a PDF file as input, and compare the result with an expected output.
My requirements.txt file looks like below:
cffi==1.17.1
charset-normalizer==3.4.0
cryptography==44.0.0
deprecated==1.2.15
deprecation==2.1.0
grpcio==1.68.1
img2pdf==0.5.1
lxml==5.3.0
markdown-it-py==3.0.0
mdurl==0.1.2
ocrmypdf==16.6.2
packaging==24.2
pdfminer-six==20240706
pi-heif==0.21.0
pikepdf==9.4.2
pillow==11.0.0
pluggy==1.5.0
protobuf==3.20.3
pycparser==2.22
pygments==2.18.0
rich==13.9.4
typing-extensions==4.12.2
wrapt==1.17.0
And I'm running ocrmypdf with the following params:
ocrmypdf {inputpdf} {outputpdf} --force-ocr --pages 1,2 --optimize 0 --tesseract-pagesegmode 6 --pdf-renderer 'hocr' --sidecar {outputtxt}
Environment Details:
- OS: Ubuntu 24.04
- System Packages:
- Ghostscript: 10.02.1
- Tesseract: 5.3.4
- pngquant: 2.18.0
- Unpaper: 7.0.0
Expected Behavior:
OCR output should match the behavior observed when using Ubuntu 22.04, producing accurate and consistent text output.
Observed Behavior:
Misrecognized characters (e.g., "p" becomes "o"). Additional spaces introduced in the OCR output. Additional Information:
The same setup works perfectly when using Ubuntu 22.04.
Where are you installing/running from?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
16.6.2
What operating system are you working on?
Linux
Operating system details and version
Ubuntu 24.04
Simple sanity checks
- [X] Operating system is currently supported by its vendor (not end of life)
- [X] Python version is compatible with OCRmyPDF
- [X] This issue is not about a specific input file
Relevant log output
No response
I did some experiments. The main difference is that Ubuntu 24.04 provides a different version of Ghostscript, 10.x, while 22.04 provides 9.55. There was a major rewrite of PDF handling between 9.x and 10.x in Ghostscript, and the new version is significantly lower in quality from an OCR perspective -- 10.x produces output that most PDF viewers will see as extra word breaks in the middle of words.
In a challenging test document, Ghostscript 10 produces
"Al l f i xt ur es and har dwar e wi l l be pr oper l y and s ecur el y i ns t al l ed." (3 words identified correctly)
while Ghostscript 9 produces
"All fi xtures and h ardware wi ll be properly and securely i nstalled." (6 words identified correctly, still not great)
@stumpylog I think paperless-ngx should consider pinning Ghostscript 9, based on my findings so far.
so we are facing the same problem on debian..lots of additional spaces. Any solutions yet?