exSTATic Some Japanese typographic symbols are counted while others are not

Some Japanese typographic symbols are counted while others are not

Open pinntokuru opened this issue 8 months ago • 1 comments

In calculations.js the ignore variable contains a list of typographic symbols to ignore for character counting purposes. I've found two relatively common characters, the fullwidth full stop ． and the katakana middle dot ・ that are not in this list. The middle dot is also in the Wikipedia page that the code refers (https://en.wikipedia.org/wiki/List_of_Japanese_typographic_symbols).

I'm sure there are many other characters that can show up that this list does not cover, and adding them all one by one is not very feasible. In that case I think going the other way and having a regex allow list is a better idea.

I went through unicode blocks of Japanese and Roman unicode blocks and came up with a set of ranges. The blocks contain special marks as well as characters that should be counted, so instead of using the entire block I took only the parts that should count as characters.

Hiragana U+3041 to U+3096

Katakana U+30A1 to U+30FA

Numbers U+FF10 to U+FF19

Roman Uppercase Letters U+FF21 to U+FF3A

Roman Lowercase Letters U+FF41 to U+FF5A

Half-width Katakana (not sure if should be included) U+FF66 to U+FF9D

CJK unifed ideographs - Common and uncommon kanji: U+4E00 - U+9FAF

CJK unified ideographs Extension A - Rare kanji: U+3400 to U+4DBF

Or, another idea might be to make it so the user can provide their own characters or regex to match in the settings page?

Jun 08 '24 07:06 pinntokuru

exSTATic exSTATic copied to clipboard

Some Japanese typographic symbols are counted while others are not

exSTATic
exSTATic copied to clipboard