normcap Spaces between Chinese characters

Great job! But, We do not use a space between characters in Chinese. I guess the Japanese and some other Asian languages have the same problem.

Dec 13 '21 11:12 inuxor

Hi @inuxor, thanks for comment! Could you please provide some more information? That would help me to fix this issue:

Does this also apply to line-breaks and/or paragraphs? (How do sentences and paragraphs get separated?)
Afaik there are two chinese languages for tesseract: chi_sim and chi_tra. Does the issue apply to both of them?
Please provide a example screenshot of some medium-length text (where NormCap added spaces). The expected output would be super useful, too.

With that information, it should be pretty trivial for me to fix :-)

PS: It seems quite strange to me, that tesseract doesn't respect that by default. If anyone has insights about that, I'd appreciated it.

Dec 15 '21 18:12 dynobo

Hi @inuxor, thanks for comment! Could you please provide some more information? That would help me to fix this issue:
1. Does this also apply to line-breaks and/or paragraphs? (How do sentences and paragraphs get separated?)

2. Afaik there are two chinese languages for tesseract: `chi_sim` and `chi_tra`. Does the issue apply to both of them?

3. Please provide a example screenshot of some medium-length text (where NormCap added spaces). The expected output would be super useful, too.
With that information, it should be pretty trivial for me to fix :-)

PS: It seems quite strange to me, that tesseract doesn't respect that by default. If anyone has insights about that, I'd appreciated it.

傅平安看到母亲骂大街的时候哭了，我也泪目了，那是一种少年终于发现父母无力保护自己时的悲伤，少年踏上社会，才知道这个社会有多残酷无情，曾经像天一样的父母其实只是普通人，在滚滚红尘中他们自顾不暇，幼鹰出巢的那一刻起，能依靠的就只有自己的翅膀。

信平安看到母亲骂大街的时候哭了 , 英那是一种少年终于发现父母无力保护自己时的悲伤 , 少年踏上社会 , 才知道这个残经像天一样的父母其实只是普通人 , 在滚滚红尘中他们自顾不醒 , 幼筑出巢的那一刻起 , 能依靠的就习有自己的翅脂

Dec 17 '21 11:12 inuxor

This indeed seems to be caused by an open issue in tesseract: https://github.com/tesseract-ocr/tesseract/issues/2702

I'll try adding a simple heuristic to remove the superfluous spaces in the next version, to mitigate the time until it's fixed upstream.

Dec 18 '21 11:12 dynobo

@inuxor, mind taking v0.2.10 for a test drive?

The spaces should now get removed, but only if you've selected only Chinese languages in the settings: it should work, if you have checked "chi_sim" + "chi_tra", but it won't work if you have checked a non Chinese language as well, e.g. "chi_sim" + "eng".

Hopefully this mitigates the issue until it gets fixed upstream :-)

Dec 27 '21 19:12 dynobo

What about adding a toggle option for this, because it also happens e.g. on Japanese, but the spaces are not automatically removed.

Sep 03 '22 14:09 kik4444

@kik4444, I'd like to avoid an additional toggle and stick to the current strategy:

All activated languages in NormCap are "space-less" -> Use heuristic to remove the spaces
"Space-less" as well as at least one "western" language is activated -> Don't touch the spaces.

I'm going to add jpn.traineddata to the "space-less" languages and try to improve the removal heuristic.

Stay tuned!

Sep 03 '22 16:09 dynobo

normcap normcap copied to clipboard

Spaces between Chinese characters

normcap
normcap copied to clipboard