OCRmyPDF
OCRmyPDF copied to clipboard
[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file
Describe the bug
Reason for this issue
I've been trying to make a lot of Japanese novels I have at home as searchable PDF, which will make it easier to check unknown kanjis. But, generating a searchable PDF didn't went out as expected.
The problem
I've tried different formats but I'm unsure why generates correctly as .txt
but isn't as .pdf
. I've attached both results: tesseract .txt
and OCRmypdf .txt
, there's also the .pdf
generated by OCRmypdf.
As you can see, both .txt
are almost identical (I've seen one diff kanji, but everything looked the same). This doesn't happen with the .pdf
. When copy text, it adds spaces:
そのまま 男 の 両足 がふわりと 浮き上がり、 彼 の中で、 世 界がぐるりと回 転した。
This doesn't happen with Apple's OCR over images, which results in the same as the .txt
file.
Notes about my computer
- I'm using a 2020 macbook pro, i5 16gb ram.
- OS is Sonoma 14.3.
- Tesseract is called via term (Hyper in my case)
- OCRmypdf is called via finder using macOS shortcuts (I've configured the same exact run as above)
Steps to reproduce
1. Run tesseract 1.png out -l jpn_vert --psm 5 -c preserve_interword_spaces=1
2. Run ocrmypdf -l jpn_vert --tesseract-pagesegmode 5 --tesseract-config [file to config with preserve_interword_spaces 1] --sidecar output.txt test.pdf output.pdf
3. Open out.txt (the one made with tesseract)
4. Open now output.txt (made with ocrmypdf)
5. Open output.pdf and copy some text.
Files
tesseract-config.cfg
preserve_interword_spaces 1
Files to test:
Generated files:
out.txt > Tesseract output output.txt > OCRmypdf output output.pdf
How did you download and install the software?
Homebrew
OCRmyPDF version
16.0.4
Relevant log output
No response