exSTATic
exSTATic copied to clipboard
Some Japanese typographic symbols are counted while others are not
In calculations.js the ignore variable contains a list of typographic symbols to ignore for character counting purposes. I've found two relatively common characters, the fullwidth full stop . and the katakana middle dot ・ that are not in this list. The middle dot is also in the Wikipedia page that the code refers (https://en.wikipedia.org/wiki/List_of_Japanese_typographic_symbols).
I'm sure there are many other characters that can show up that this list does not cover, and adding them all one by one is not very feasible. In that case I think going the other way and having a regex allow list is a better idea.
I went through unicode blocks of Japanese and Roman unicode blocks and came up with a set of ranges. The blocks contain special marks as well as characters that should be counted, so instead of using the entire block I took only the parts that should count as characters.
Hiragana U+3041 to U+3096
Katakana U+30A1 to U+30FA
Numbers U+FF10 to U+FF19
Roman Uppercase Letters U+FF21 to U+FF3A
Roman Lowercase Letters U+FF41 to U+FF5A
Half-width Katakana (not sure if should be included) U+FF66 to U+FF9D
CJK unifed ideographs - Common and uncommon kanji: U+4E00 - U+9FAF
CJK unified ideographs Extension A - Rare kanji: U+3400 to U+4DBF
Or, another idea might be to make it so the user can provide their own characters or regex to match in the settings page?