rikaikun icon indicating copy to clipboard operation
rikaikun copied to clipboard

Recognize colloquial corruptions such as あい/おい → ええ

Open ChocoChopin opened this issue 4 years ago • 3 comments

This might prove useful to those of us who do much of our immersion with anime: colloquial speech contains many common patterns of corruption, such as the vowel pairs ーあい and ーおい being rendered ーええ (e.g., in stereotypically masculine/toughguy speech).

It occurs to me that this could be detected in much the same way that conjugations of verbs and adjectives are currently detected, perhaps using a tag like "< masc" or something along those lines. There are probably many more patterns than just the two, but those are the two I can remember off the top of my head.

ChocoChopin avatar Sep 02 '20 15:09 ChocoChopin

Thanks!

This isn't a bad idea though the two words I hear the most (すげえ やべえ) are in the dictionary as separate entries already. What are some other examples we can use for testing?

the double え isn't that common otherwise so I don't think false positives would be a problem (though that happens with regular verb congjugations as well).

I will say that this would probably be lower priority than some other stuff in the queue, though I am trying to actually make consistent improvements to rikaikun these days.

melink14 avatar Sep 03 '20 00:09 melink14

In my experience, characters that do it will tend to do it with a wide variety of words, and seemingly at random--examples I can think of off the top of my head are しつけえ, おせえ, うるせえ, しらねえ, かっけえ. By far the most common transformation is -ない to -ねえ, so that one detection alone would take care of a lot, but it probably wouldn't be feasible to do other words on a word-by-word basis since there doesn't appear to be any consistent pattern to which words get the treatment; you'd just have to take words ending in ええ, change those endings to あい/おい, and see if they then match existing dictionary entries.

Two characters that make good exemplars of this are Inuyasha and Son Goku. It certainly seems to be a 少年 thing.

On an unrelated note, I see that you added that bit of code to prevent that font issue from happening again. That's awesome, and I really admire your continuing dedication to the project.

ChocoChopin avatar Sep 03 '20 12:09 ChocoChopin

Thanks for the extra context. You're right that it needs to be generic; I was spacing on more examples but definitely can think of some.

Adding these directly isn't too bad, but since you actually have to add a mapping for each hiragana ending in あ or お the more sustainable approach would be to set up a script which generates the deinflect.dat based on higher level rules. That will ensure fewer mistakes when updating as well.

I'll make a separate issue for that. Thanks again for the feedback.

melink14 avatar Sep 04 '20 00:09 melink14