kanji-frequency Bias due to kanji repeating within a document

Bias due to kanji repeating within a document

Open GustavoQueipo opened this issue 3 years ago • 4 comments

The kanji frequency lists here are skewed due to the counting methodology, as explained in the cited text below. For more accurate results, the frequency formula should be:

f = number of documents in which a kanji appears at least once / total number of documents

I suggest either changing the current lists to the new formula, or adding alternate lists that use this formula.

"The methodology for counting the characters is quite not right and tends to favor some kanji. Every table of kanji usage frequency I’ve found online, by Shpika or by others, is made by simply counting the number of times a given kanji is found in a whole text corpus and computing its frequency of occurrence using the total number of kanji in the corpus. However, the resulting data is biased and not really representative of the usage of each kanji, especially for less common ones. The reason for this is that if some uncommon kanji appears in a given book, chances are it appears several times in this book. This is especially the case for character names and place names. Let’s stretch this reasoning to an extreme and consider a book in which a character’s name has a very rare kanji. Let’s say this kanji is so rare that it doesn’t appear in any the other several thousands books in the collection. The character’s name may appear, say, a few dozen times in the whole book. Thus the rare kanji will be counted several dozen times even though it’s never been used by any other author in the collection."

Source: VTRM

Nov 05 '20 19:11 GustavoQueipo

Thank you for the link, it's very interesting.

There are biases in both methods, and probably in every method. Specifically, if we use the proposed formula, we ignore the size of documents. For example, twits are 140 characters maximum, and some books from Aozora bunko are hundreds of pages long, and if some rare kanji appears once in each, it's negligible for the book but probably very significant for the twit.

That being said, I would rather use both methods in parallel.

for Aozora, it would be a simple modification of the script
for Twitter, it's also simple, but the data has to be collected again
for news, I'd have to find completely new sources of data because the scripts I used in 2015 would not work anymore due to changes in the markup of the websites I scraped
for Wikipedia, I'd have to parse the XML, I don't how difficult that is, but I guess there may be some libraries which do that already

In short, it's a lot of work. I will look into that, but I don't promise anything.

I'll update the readme to include the link to this awesome article and its dataset.

Nov 06 '20 19:11 scriptin

This will turn into an interesting conversation about statistics and linguistics!

You are right both methods are biased in their own way, this is inevitable whatever the method used. The corpus of text used introduces its own bias, and the counting formula only adds to it. The definition of "frequency" is subjective to an extent, and so it must introduces a bias.

An example: In Chinese, the character 哈 (onomatopoeia representing the sound of laughing, equivalent to English "ha") is used in casual written contexts. It can appear repeated any number of times , as in "哈哈哈哈", in a similar way we would write in english "hahahaha". Check the Twitter link below. I am sure you can find other tweets consisting of 140 repetitions of that same character. We could argue that counting the character once would make more sense, but this is of course subjective. https://twitter.com/xianduguaitan/status/1156238676613054464

Rather than saying that the method you used is skewed, I should have said that I expect the counting method I proposed to produce results more representative of an "ideal frequency" in normal, natural language usage. My reasoning is that, in statistics, when considering average frequency of occurrence, it is assumed that the probability of occurrence is independent of past occurrences. Natural language text does not obey this principle, as it shows strong locality: a topic, concept or quality under discussion will often be mentioned repeatedly. Once a word has occured once, it is then more likely to appear again shortly afterwards, with its probability of occurrence diminishing as the distance from its last appearance increases.

Regarding the issue of ignoring the size of documents in the new formula, I think it is not a problem. The size is still being taken into account implicitly:

Very short documents will only contribute to the frequency count of a few characters.
Very long documents will contribute to the frequency count of a large proportion of the characters.
For each data set, all documents can be considered of a similar magnitude of size (tweets and hundreds-of-pages long texts are in different data sets).

Note we need to assume that every document in each corpus analysed has the same significance (otherwise, we would have to assign weights to each document). Also note that, in the case of Twitter, a "document" should consist of a whole conversation including the original tweet and all replies, due to the principle of locality discussed earlier. Care should be taken not to count each reply twice, both as part of a conversation and as a standalone post.

This is all just a hypothesis without any proof. It would thus be interesting to have both counting methods in parallel, to be able to compare the results and use whatever is approppriate for each situation.

The more we think about it, the more complicated it is going to get. This is an extremely complex topic. Otherwise, by now we would have official frequency lists for languages that are over a thousand years old. The good thing is that even a naive approach will produce fairly good results. Whatever the method, except for a few exceptions, common characters will always be at the top and rare characters at the bottom :)

About the modifications, I would say go for the easy Twitter one, and then consider looking at the others if and when you find the time. For Aozora we already have the result set, so it's not a big loss if you don't do it. XML is a pain to parse, so there are a million libraries to handle that.

Nov 07 '20 04:11 GustavoQueipo

Your example with the "哈" character makes a lot of sense. And actually, it reminds me about one more known issue, that in the Twitter dataset we see characters used primarily in emoji:

( ^ω^)个 (umbrella/flower?)
Ｕ^皿^Ｕ (grin/teeth, mustache?)
(　’ω’)旦~~ (cup)
(╯°益°)╯ (rage face)
(oT-T)尸 (flag)
(ノ＞＜)ノ (arms)

There are more, and it's a nightmare to filter. I excluded some of those while working on topokanji project.

For each data set, all documents can be considered of a similar magnitude of size (tweets and hundreds-of-pages long texts are in different data sets).

This is actually only true for Twitter and news (news articles are usually about the same size within each particular website). Aozora has some very short stories as well as very long ones, and article length on Wikipedia varies a lot too.

Also note that, in the case of Twitter, a "document" should consist of a whole conversation including the original tweet and all replies, due to the principle of locality discussed earlier. Care should be taken not to count each reply twice, both as part of a conversation and as a standalone post.

I would rather just count replies as separate twits. I understand that twits with replies are basically dialogues on the same topic, but we can say that some "origin" twits are often replying to something outside Twitter if they are commenting on some external articles/videos/events. And, we can also say that by this argument we should include comment sections in the news since those are replies.

So, any attempts to group twits into dialogues will have some serious logic flaws anyway, and not attempting to do that is probably the best thing.

Or, another way would be to ignore replies completely. What do you think about this idea?

I would say go for the easy Twitter one

If you count replies to a twit and the twit itself as a single document, it's actually really complicated. In the streaming API twits come as they appear, not in any reply-based order. I think there is a way to tell if a twit is a reply, but matching replies to their corresponding original twits is a much harder problem that requires storing twits data - at least an ID with a list of unique kanji for each twit, in case later there will be some replies to it - so that we could merge the data from replies. We need a whole new beast a Twitter bot to do that.

That's another argument to count each twit separately. Unless you have some technical insight that could make it simpler.

Nov 07 '20 14:11 scriptin

another way would be to ignore replies completely. What do you think about this idea?

That sounds like a good compromise.

Aozora has some very short stories as well as very long ones, and article length on Wikipedia varies a lot too.

I think that is fine. If each kanji will only be counted once per document, very long documents will not skew the results by having some rare kanji appear repeatedly. Very short documents can only contribute to the count of a few characters. I think it is a good self-balancing method.

Nov 17 '20 10:11 GustavoQueipo

kanji-frequency kanji-frequency copied to clipboard

Bias due to kanji repeating within a document

kanji-frequency
kanji-frequency copied to clipboard