rikaikun
rikaikun copied to clipboard
Recognize colloquial corruptions such as あい/おい → ええ
This might prove useful to those of us who do much of our immersion with anime: colloquial speech contains many common patterns of corruption, such as the vowel pairs ーあい and ーおい being rendered ーええ (e.g., in stereotypically masculine/toughguy speech).
It occurs to me that this could be detected in much the same way that conjugations of verbs and adjectives are currently detected, perhaps using a tag like "< masc" or something along those lines. There are probably many more patterns than just the two, but those are the two I can remember off the top of my head.
Thanks!
This isn't a bad idea though the two words I hear the most (すげえ やべえ) are in the dictionary as separate entries already. What are some other examples we can use for testing?
the double え isn't that common otherwise so I don't think false positives would be a problem (though that happens with regular verb congjugations as well).
I will say that this would probably be lower priority than some other stuff in the queue, though I am trying to actually make consistent improvements to rikaikun these days.
In my experience, characters that do it will tend to do it with a wide variety of words, and seemingly at random--examples I can think of off the top of my head are しつけえ, おせえ, うるせえ, しらねえ, かっけえ. By far the most common transformation is -ない to -ねえ, so that one detection alone would take care of a lot, but it probably wouldn't be feasible to do other words on a word-by-word basis since there doesn't appear to be any consistent pattern to which words get the treatment; you'd just have to take words ending in ええ, change those endings to あい/おい, and see if they then match existing dictionary entries.
Two characters that make good exemplars of this are Inuyasha and Son Goku. It certainly seems to be a 少年 thing.
On an unrelated note, I see that you added that bit of code to prevent that font issue from happening again. That's awesome, and I really admire your continuing dedication to the project.
Thanks for the extra context. You're right that it needs to be generic; I was spacing on more examples but definitely can think of some.
Adding these directly isn't too bad, but since you actually have to add a mapping for each hiragana ending in あ or お the more sustainable approach would be to set up a script which generates the deinflect.dat based on higher level rules. That will ensure fewer mistakes when updating as well.
I'll make a separate issue for that. Thanks again for the feedback.