OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

[Bug]: Existing text is completely replaced with other characters

Open david-sledge opened this issue 4 weeks ago • 3 comments

Describe the bug

Found an issue with certain PDFs that already have text where the text is replaced with other characters and renders the PDFs unreadable. This happens with the --redo-ocr and --skip-text flags. Attached are (a) a sample PDF (b) the results of it being OCRed, and (c) a zip file containing everything needed to reproduce the issue.

Steps to reproduce

1. Download the tarball to a linux machine with Docker installed.
2. Run the following command chain: tar -xzf bad-pdf-example.tar.gz && cd bad-pdf-example && docker run --rm -v .:/root/test-files -it $(docker build -q -t ocrmypdf-test .) && docker rmi ocrmypdf-test:latest
3. Open test-redo-ocr-result.pdf and test-skip-text-result.pdf

Files

test.pdf test-redo-ocr-result.pdf test-skip-text-result.pdf bad-pdf-example.tar.gz

How did you download and install the software?

Linux package manager (apt, dnf, etc.), Docker container

OCRmyPDF version

16.3.1

Relevant log output

tesseract 5.4.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.17
OCRmyPDF version:
16.3.1
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 skipping all processing on this page                                                                                                                                                                                      _pipeline.py:330
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                                                                                                     ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                             _metadata.py:62
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                                                                    _pipeline.py:989
Total file size ratio: 0.06 savings: -1515.7%                                                                                                                                                                                   _pipeline.py:992
Output file is a PDF/A-2B (as expected)                                                                                                                                                                                           _common.py:441
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 redoing OCR                                                                                                                                                                                                               _pipeline.py:327
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                                                                                                     ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                             _metadata.py:62
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                                                                    _pipeline.py:989
Total file size ratio: 0.06 savings: -1554.7%                                                                                                                                                                                   _pipeline.py:992
Output file is a PDF/A-2B (as expected)

david-sledge avatar Jun 18 '24 23:06 david-sledge