kanji-frequency icon indicating copy to clipboard operation
kanji-frequency copied to clipboard

Absence of non-BMP characters

Open JPRidgeway opened this issue 5 years ago • 2 comments

While studying the database, I see that there is not a single occurrence of non-BMP characters in it. Was it a consequence of the method used and, if so, would it be possible to ascertain the presence of any U+2XXXX characters within it? (Similarly, there are no Compatibility characters in the lists, which leads me to a suspicion the data were completely Unicode-normalized before analysis, which deletes some data irretrievable specifically in Japanese case.)

JPRidgeway avatar May 16 '20 13:05 JPRidgeway

You can see an example of a processing method I used here: https://github.com/scriptin/twitter-kanji-frequency/blob/master/collect-data.js

Basically, I did text.replace(/[^\u4e00-\u9fff]+/g, '') to get rid of everything except for desired characters. (Note the RegExp is negated, but you can see the range.)

I don't have a quick way to fix this. Obviously, for the Twitter dataset, I'd have to run data collection bot again, which would take a lot of time. And for other datasets, I used some one-time scripts, which I'd have to find (I believe I have them somewhere). I can find them if you need, in case you're willing to do the processing yourself.

For the news dataset that won't work, because the sites I used to gather data from most likely have changed their page structure/markup, and crawling scripts for them are no longer valid.

Do you think it's an issue? Can you explain why, if that's the case?

scriptin avatar May 22 '20 17:05 scriptin

I will include the following Unicode blocks in the next version:

Basic datasets versions:

Extended datasets versions:

Full datasets versions:

The last list may change, I need to review those blocks.

scriptin avatar Nov 08 '20 20:11 scriptin

This issue is addressed in the new version. Files now include non-BPM characters. Additionally, there are "extended" files (with ext in the names) which contain some obscure kanji characters - parenthesised, circled kanji, telegraph symbols, etc. - see han.js for details.

scriptin avatar Feb 25 '23 16:02 scriptin