OCRmyPDF [Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file

[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file

Open matsumurae opened this issue 1 year ago • 0 comments

Describe the bug

Reason for this issue

I've been trying to make a lot of Japanese novels I have at home as searchable PDF, which will make it easier to check unknown kanjis. But, generating a searchable PDF didn't went out as expected.

The problem

I've tried different formats but I'm unsure why generates correctly as .txt but isn't as .pdf. I've attached both results: tesseract .txt and OCRmypdf .txt, there's also the .pdf generated by OCRmypdf.

As you can see, both .txt are almost identical (I've seen one diff kanji, but everything looked the same). This doesn't happen with the .pdf. When copy text, it adds spaces: そのまま男の両足がふわりと浮き上がり、彼の中で、世界がぐるりと回転した。

This doesn't happen with Apple's OCR over images, which results in the same as the .txt file.

Notes about my computer

I'm using a 2020 macbook pro, i5 16gb ram.
OS is Sonoma 14.3.
Tesseract is called via term (Hyper in my case)
OCRmypdf is called via finder using macOS shortcuts (I've configured the same exact run as above)

Steps to reproduce

1. Run tesseract 1.png out -l jpn_vert --psm 5 -c preserve_interword_spaces=1
2. Run ocrmypdf -l jpn_vert --tesseract-pagesegmode 5 --tesseract-config [file to config with preserve_interword_spaces 1] --sidecar output.txt test.pdf output.pdf
3. Open out.txt (the one made with tesseract)
4. Open now output.txt (made with ocrmypdf)
5. Open output.pdf and copy some text.

Files

`tesseract-config.cfg`

preserve_interword_spaces 1

Files to test:

test.pdf

Generated files:

out.txt > Tesseract output output.txt > OCRmypdf output output.pdf

How did you download and install the software?

Homebrew

OCRmyPDF version

16.0.4

Relevant log output

No response

Feb 03 '24 11:02 matsumurae

OCRmyPDF OCRmyPDF copied to clipboard

[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file

Describe the bug

Reason for this issue

The problem

Notes about my computer

Steps to reproduce

Files

tesseract-config.cfg

Files to test:

Generated files:

How did you download and install the software?

OCRmyPDF version

Relevant log output

OCRmyPDF
OCRmyPDF copied to clipboard

`tesseract-config.cfg`