kanji-frequency
kanji-frequency copied to clipboard
Absence of non-BMP characters
While studying the database, I see that there is not a single occurrence of non-BMP characters in it. Was it a consequence of the method used and, if so, would it be possible to ascertain the presence of any U+2XXXX characters within it? (Similarly, there are no Compatibility characters in the lists, which leads me to a suspicion the data were completely Unicode-normalized before analysis, which deletes some data irretrievable specifically in Japanese case.)
You can see an example of a processing method I used here: https://github.com/scriptin/twitter-kanji-frequency/blob/master/collect-data.js
Basically, I did text.replace(/[^\u4e00-\u9fff]+/g, '') to get rid of everything except for desired characters. (Note the RegExp is negated, but you can see the range.)
I don't have a quick way to fix this. Obviously, for the Twitter dataset, I'd have to run data collection bot again, which would take a lot of time. And for other datasets, I used some one-time scripts, which I'd have to find (I believe I have them somewhere). I can find them if you need, in case you're willing to do the processing yourself.
For the news dataset that won't work, because the sites I used to gather data from most likely have changed their page structure/markup, and crawling scripts for them are no longer valid.
Do you think it's an issue? Can you explain why, if that's the case?
I will include the following Unicode blocks in the next version:
Basic datasets versions:
- CJK Unified Ideographs - in the current version, that is the only Unicode block included
Extended datasets versions:
- Everything from basic
- CJK Unified Ideographs Extension A
- CJK Unified Ideographs Extension B
- CJK Unified Ideographs Extension C
- CJK Unified Ideographs Extension D
- CJK Unified Ideographs Extension E
- CJK Unified Ideographs Extension F
- CJK Unified Ideographs Extension G
- CJK Unified Ideographs Extension H
Full datasets versions:
- Everything from extended
- CJK Radicals Supplement
- Kangxi radicals
- CJK Strokes
- Enclosed CJK Letters and Months - partially, only those containing kanji
- CJK Compatibility Ideographs
- Enclosed Ideographic Supplement - partially, only those containing kanji
- CJK Compatibility Ideographs Supplement
The last list may change, I need to review those blocks.
This issue is addressed in the new version. Files now include non-BPM characters. Additionally, there are "extended" files (with ext in the names) which contain some obscure kanji characters - parenthesised, circled kanji, telegraph symbols, etc. - see han.js for details.