OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file

Open matsumurae opened this issue 1 year ago • 0 comments

Describe the bug

Reason for this issue

I've been trying to make a lot of Japanese novels I have at home as searchable PDF, which will make it easier to check unknown kanjis. But, generating a searchable PDF didn't went out as expected.

The problem

I've tried different formats but I'm unsure why generates correctly as .txt but isn't as .pdf. I've attached both results: tesseract .txt and OCRmypdf .txt, there's also the .pdf generated by OCRmypdf.

As you can see, both .txt are almost identical (I've seen one diff kanji, but everything looked the same). This doesn't happen with the .pdf. When copy text, it adds spaces: そのまま 男 の 両足 がふわりと 浮き上がり、 彼 の中で、 世 界がぐるりと回 転した。

This doesn't happen with Apple's OCR over images, which results in the same as the .txt file.

Notes about my computer

  • I'm using a 2020 macbook pro, i5 16gb ram.
  • OS is Sonoma 14.3.
  • Tesseract is called via term (Hyper in my case)
  • OCRmypdf is called via finder using macOS shortcuts (I've configured the same exact run as above)

Steps to reproduce

1. Run tesseract 1.png out -l jpn_vert --psm 5 -c preserve_interword_spaces=1
2. Run ocrmypdf -l jpn_vert --tesseract-pagesegmode 5 --tesseract-config [file to config with preserve_interword_spaces 1] --sidecar output.txt test.pdf output.pdf
3. Open out.txt (the one made with tesseract)
4. Open now output.txt (made with ocrmypdf)
5. Open output.pdf and copy some text.

Files

tesseract-config.cfg

preserve_interword_spaces 1

Files to test:

1 test.pdf

Generated files:

out.txt > Tesseract output output.txt > OCRmypdf output output.pdf

How did you download and install the software?

Homebrew

OCRmyPDF version

16.0.4

Relevant log output

No response

matsumurae avatar Feb 03 '24 11:02 matsumurae