tesseract Recognize japanese symbols in two screenshots

Recognize japanese symbols in two screenshots

Open superbonaci opened this issue 1 year ago • 4 comments

Current Behavior

Recognize the symbols.

Expected Behavior

Recognize the symbols in these two screenshots. Original pictures from Dragon Ball episode 1:

goku1

goku2

After some perspective correction (maybe helps?):

goku1-ed

goku2-ed

Suggested Fix

Recognize the symbols.

tesseract -v

tesseract 5.3.2 leptonica-1.82.0 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.11 : libwebp 1.3.1 : libopenjp2 2.5.0 Found NEON Found libarchive 3.6.2 zlib/1.2.11 liblzma/5.4.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.4 Found libcurl/7.88.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 nghttp2/1.51.0

Operating System

macOS 13 Ventura

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

Jul 19 '23 09:07 superbonaci

Tesseract's layout analysis was designed to deal with simple layouts of books, magazines, newspapers and documents.

For any image that Tesseract completely fails to recognize, or fails to recognize some areas in the image, it is recommended to use a different tool to clean the image for Tesseract and make it easier for Tesseract to recognize text.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

In your case, you should give Tesseract just the letters without the frame around them.

Mar 16 '24 22:03 amitdo

No result either with the improved picture:

result

% tesseract -l jpn result.png result.txt
Empty page!!
Empty page!!
% tesseract -l script/Japanese result.png result.txt
Empty page!!
Empty page!!

Mar 17 '24 12:03 superbonaci

Did you try with different psm values?

Mar 17 '24 12:03 amitdo

Still no luck, but Google Lens finds it fine: https://ja.wikipedia.org/wiki/%E5%80%92%E7%A6%8F

Mar 17 '24 15:03 superbonaci

tesseract tesseract copied to clipboard

Recognize japanese symbols in two screenshots

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

tesseract
tesseract copied to clipboard