Kawazu Hiraganas does not take context into account (like numbers)

Hi, Not sure if this is a Kawazu or LibNMeCab issue, but when converting kanjis ignores exceptions. For example, if one wants to convert 300 with 三百, it will output さんひゃく (sanhyaku) but the correct answer is さんびゃく (sanbyaku). Same for 600 and 900.

Currently working on a workaround on my fork: https://github.com/lasyan3/Kawazu/commit/156bf7eb4b587467ee1f1b993da6523a79a71604

May 24 '21 06:05 lasyan3

I see.. I think it is MeCab didn't regard 三百 as a whole word. By handling special cases is an effective method but might be hard to cover all the cases. But currently, it is the best way to solve this problem.

May 25 '21 02:05 Cutano

I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not. Anyway, feel free to use my code from my fork if you want to and if you think it's ok (I had to change the JapaneseElement from struct to class to be able to set some properties).

May 25 '21 05:05 lasyan3

I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not. Anyway, feel free to use my code from my fork if you want to and if you think it's ok (I had to change the JapaneseElement from struct to class to be able to set some properties).

Sure, if it goes well in your project please let me know and send me a pull request if possible. Handling cases one by one is better than handling nothing, thanks for the advice.

May 25 '21 09:05 Cutano

I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not.

I think you could use JMDICT and go through each element in that dictionary and check if the reading generated by NMeCab is the same as in the dictionary, if not, add it to a list that you can use to determine the special cases. It would be faster than doing it manually one by one.

May 25 '21 10:05 ookii-tsuki

Well I downloaded the Wacton Desu nuget package (which is a dotnet port of the JMDICT) to make some tests and I just realized that the issue seems to only appear with kanjis showing numbers. Indeed, with a sentence, words are correctly "divided" from each others. For example : 日本語を勉強します --> [日本語] [を] [勉強] [し] [ます] And in that case, as far as I can tell, the reading is correct. But with numbers, all kanjis are splited and so the program cannot detect the specials readings. So if I am right, we just have to handle the cases for numbers, which is pretty small.

May 26 '21 15:05 lasyan3

Yeah I think just fixing numbers would be enough Also in case you didn't know, counters are also treated the same way as numbers For example 一人(ひとり) would be divided into two elements 一(いち)、人(にん) while it should be one word. Or 一回(いっかい) would divide to 一(いち)、回(かい) And a lot of other counters.

May 26 '21 16:05 ookii-tsuki

Hmm so maybe the root cause is the way NmeCab is parsing sentences ? Maybe there is a way to make it correctly group counters and numbers, I'll investigate this way.

May 26 '21 17:05 lasyan3

I think the actual problem is from the IpaDic, NMeCab uses that dictionary to parse the sentences. NMeCab also supports UniDic which I heard is better than IpaDic and up-to-date but it's a lot bigger (2GB or so) but I don't know if it has this problem with numbers or not.

May 26 '21 17:05 ookii-tsuki

I think I found a solution to deal with that issue. Maybe not the best one, but at least seems to work. I let Kawazu split the sentence and I keep only the kanjis who are alone (because in other cases that means Kawazu identified the combo and thus the proper reading). Then I use Wacton Desu to analyze the kanjis left and compare the readings. You can view the detailed implementation in my repo, in the branch "desu": https://github.com/lasyan3/Kawazu/commit/93bb51f74e3e7e2c40cd4f8c48d7ada548ef0d0e

May 28 '21 15:05 lasyan3

I don't think implementing Wacton library is a good idea because it uses a lot of RAM (about 460MB for the Japanese enteries) so it's better to run this test in a separate project and get all the cases where the reading is wrong and then save them in a json or xml and then use that to check for wrong readings in Kawazu

May 28 '21 15:05 ookii-tsuki

I tried running the test to get all the cases but I end with 58109 wrong readings, this seems anormaly big to me. So for now I'll stay on my first idea, dealing with counters and adding exceptions each time I see a new kind.

Jun 07 '21 06:06 lasyan3