normcap icon indicating copy to clipboard operation
normcap copied to clipboard

Spaces between Chinese characters

Open inuxor opened this issue 3 years ago • 4 comments

Great job! But, We do not use a space between characters in Chinese. I guess the Japanese and some other Asian languages have the same problem.

inuxor avatar Dec 13 '21 11:12 inuxor

Hi @inuxor, thanks for comment! Could you please provide some more information? That would help me to fix this issue:

  1. Does this also apply to line-breaks and/or paragraphs? (How do sentences and paragraphs get separated?)
  2. Afaik there are two chinese languages for tesseract: chi_sim and chi_tra. Does the issue apply to both of them?
  3. Please provide a example screenshot of some medium-length text (where NormCap added spaces). The expected output would be super useful, too.

With that information, it should be pretty trivial for me to fix :-)

PS: It seems quite strange to me, that tesseract doesn't respect that by default. If anyone has insights about that, I'd appreciated it.

dynobo avatar Dec 15 '21 18:12 dynobo

Hi @inuxor, thanks for comment! Could you please provide some more information? That would help me to fix this issue:

1. Does this also apply to line-breaks and/or paragraphs? (How do sentences and paragraphs get separated?)

2. Afaik there are two chinese languages for tesseract: `chi_sim` and `chi_tra`. Does the issue apply to both of them?

3. Please provide a example screenshot of some medium-length text (where NormCap added spaces). The expected output would be super useful, too.

With that information, it should be pretty trivial for me to fix :-)

PS: It seems quite strange to me, that tesseract doesn't respect that by default. If anyone has insights about that, I'd appreciated it. Screenshot_20211217_192407 Screenshot_20211217_192733

傅平安看到母亲骂大街的时候哭了,我也泪目了,那是一种少年终于发现父母无力保护自己时的悲伤,少年踏上社会,才知道这个社会有多残酷无情,曾经像天一样的父母其实只是普通人,在滚滚红尘中他们自顾不暇,幼鹰出巢的那一刻起,能依靠的就只有自己的翅膀。

信 平 安 看 到 母 亲 骂 大 街 的 时 候 哭 了 , 英 那 是 一 种 少 年 终 于 发 现 父 母 无 力 保 护 自 己 时 的 悲 伤 , 少 年 踏 上 社 会 , 才 知 道 这 个 残 经 像 天 一 样 的 父 母 其 实 只 是 普 通 人 , 在 滚 滚 红 尘 中 他 们 自 顾 不 醒 , 幼 筑 出 巢 的 那 一 刻 起 , 能 依 靠 的 就 习 有 自 己 的 翅 脂

inuxor avatar Dec 17 '21 11:12 inuxor

This indeed seems to be caused by an open issue in tesseract: https://github.com/tesseract-ocr/tesseract/issues/2702

I'll try adding a simple heuristic to remove the superfluous spaces in the next version, to mitigate the time until it's fixed upstream.

dynobo avatar Dec 18 '21 11:12 dynobo

@inuxor, mind taking v0.2.10 for a test drive?

The spaces should now get removed, but only if you've selected only Chinese languages in the settings: it should work, if you have checked "chi_sim" + "chi_tra", but it won't work if you have checked a non Chinese language as well, e.g. "chi_sim" + "eng".

Hopefully this mitigates the issue until it gets fixed upstream :-)

dynobo avatar Dec 27 '21 19:12 dynobo

What about adding a toggle option for this, because it also happens e.g. on Japanese, but the spaces are not automatically removed.

kik4444 avatar Sep 03 '22 14:09 kik4444

@kik4444, I'd like to avoid an additional toggle and stick to the current strategy:

  • All activated languages in NormCap are "space-less" -> Use heuristic to remove the spaces
  • "Space-less" as well as at least one "western" language is activated -> Don't touch the spaces.

I'm going to add jpn.traineddata to the "space-less" languages and try to improve the removal heuristic.

Stay tuned!

dynobo avatar Sep 03 '22 16:09 dynobo