exSTATic icon indicating copy to clipboard operation
exSTATic copied to clipboard

Some Japanese typographic symbols are counted while others are not

Open pinntokuru opened this issue 8 months ago • 1 comments

In calculations.js the ignore variable contains a list of typographic symbols to ignore for character counting purposes. I've found two relatively common characters, the fullwidth full stop . and the katakana middle dot ・ that are not in this list. The middle dot is also in the Wikipedia page that the code refers (https://en.wikipedia.org/wiki/List_of_Japanese_typographic_symbols).

I'm sure there are many other characters that can show up that this list does not cover, and adding them all one by one is not very feasible. In that case I think going the other way and having a regex allow list is a better idea.

I went through unicode blocks of Japanese and Roman unicode blocks and came up with a set of ranges. The blocks contain special marks as well as characters that should be counted, so instead of using the entire block I took only the parts that should count as characters.

Hiragana U+3041 to U+3096

Katakana U+30A1 to U+30FA

Numbers U+FF10 to U+FF19

Roman Uppercase Letters U+FF21 to U+FF3A

Roman Lowercase Letters U+FF41 to U+FF5A

Half-width Katakana (not sure if should be included) U+FF66 to U+FF9D

CJK unifed ideographs - Common and uncommon kanji: U+4E00 - U+9FAF

CJK unified ideographs Extension A - Rare kanji: U+3400 to U+4DBF

Or, another idea might be to make it so the user can provide their own characters or regex to match in the settings page?

pinntokuru avatar Jun 08 '24 07:06 pinntokuru