gImageReader icon indicating copy to clipboard operation
gImageReader copied to clipboard

Vertical writing systems are not handled correctly in gImageReader

Open lhy7889678 opened this issue 1 year ago • 3 comments

Vertical writing systems can be OCRed (fairly) reliably with the tesseract command-line tool, but will get garbled characters with gImageReader by default. Horizontal writing systems are not affected.

Here are some sample images (in chi_sim, jpn, chi_sim_vert, jpn_vert respectively):

chi_sim jpn chi_sim_vert jpn_vert

Here are the results using tesseract:

tesseract

(縦組み is not OCRed correctly, but that is not a big problem.)

Here is the result using gImageReader (taking jpn_vert as an example):

gimagereader

I noticed that after rotating the image 90° counterclockwise, the result will be correct:

gimagereader_rot

(and 縦組み is OCRed correctly!)

The issue has been reported in Issue #552, but it is mistakenly regarded as a bug in tessdata. Since the tesseract command-line tool can handle it correctly, it is definitely gImageReader's fault.

I'm using gImageReader 3.4.2 and tesseract 5.4.1 under Arch Linux, using the default tessdata provided by tesseract. I noticed that gImageReader says it is using tesseract 5.3.4 in the "About" dialog, so this might have something to do with the problem.

lhy7889678 avatar Sep 09 '24 07:09 lhy7889678

Does it help to select this page segmentation option:

Image

?

manisandro avatar Jul 09 '25 21:07 manisandro

Does it help to select this page segmentation option: [...]?

Yes, "Assume single block of vertically aligned text" works for me. On the other hand, both "Automatic page segmentation" and "Page segmentation with orientation and script detection" fail in this situation.

However, I did another experiment. When there are multiple lines, both "Automatic page segmentation" and "Page segmentation with orientation and script detection" work out of the box. Here is a sample image in Chinese:

Image

The tesseract command-line tool can handle both single-line case and multiple-line case correctly without any tweaks (except for a few misrecognized characters).

PS: The sample text is a Chinese version of "Lorem ipsum". The original text is

劳仑衣普桑,认至将指点效则机,最你更枝。想极整月正进好志次回总般,段然取向使张规军证回,世市总李率英茄持伴。用阶千样响领交出,器程办管据家元写,名其直金团。化达书据始价算每百青,金低给天济办作照明,取路豆学丽适市确。如提单各样备再成农各政,设头律走克美技说没,体交才路此在杠。响育油命转处他住有,一须通给对非交矿今该,花象更面据压来。与花断第然调,很处己队音,程承明邮。常系单要外史按机速引也书,个此少管品务美直管战,子大标蠢主盯写族般本。农现离门亲事以响规,局观先示从开示,动和导便命复机李,办队呆等需杯。见何细线名必子适取米制近,内信时型系节新候节好当我,队农否志杏空适花。又我具料划每地,对算由那基高放,育天孝。派则指细流金义月无采列,走压看计和眼提问接,作半极水红素支花。

lhy7889678 avatar Jul 12 '25 10:07 lhy7889678

One would need to check the tesseract executable source code to figure out how it configures the engine.

manisandro avatar Jul 12 '25 18:07 manisandro