horta-hell icon indicating copy to clipboard operation
horta-hell copied to clipboard

Use some phonetic reference to detect vowels

Open ForNeVeR opened this issue 10 years ago • 7 comments
trafficstars

There's a vowel detector in MucMessageHandler used for the nick replacement. I've hardcoded lists of vowels from the languages I know, but that's not enough. I think we should improve it to use some generic vowel dictionary or something like this.

ForNeVeR avatar Sep 21 '15 15:09 ForNeVeR

You can find this list here: https://github.com/codingteam/horta-hell/blob/2cc2a0347fb2c31cf004e86b62cfc2ef324a35aa/src/main/scala/ru/org/codingteam/horta/protocol/jabber/MucMessageHandler.scala#L171

rexim avatar Sep 21 '15 15:09 rexim

So, I was, uh, browsing issues in horta and simply could not walk by this one. Sadly, one cannot simply create a comprehensive vowel set without making tough decisions (some of them political), because the very definition of what is a vowel and what is a consonant depends on pronunciative norm instead of strict rulings. For example, in English y is consonant -- but not in Spanish (most of the time). Й is a vowel for me, but it is a consonant for all of the people not ~~blessed~~ cursed with Uralic accent. W can be a vowel depending on it's position in a word and the language the word is from. So there's really no 'proper' way to detect vowels short of letting a corps of linguists loose at your corpus. Whatever works, works.

tl;dr: you can scrounge http://unicode.org/Public/UNIDATA/NamesList.txt for the more obscure vowels, but it will break anyway when I'll use Futhark or Glagolitic script in my nickname again.

hagane avatar Mar 06 '16 03:03 hagane

If only there was a phonetic translator (from Unicode string to IPA, for example) for more or less common languages - that'd be enough for me. I've tried to find it, but found only some scripts for obscure languages like this, and no general solution. We need to search some other ideas.

Also I wouldn't care if horta decisions would contradict some of the accented speakers' conceptions about what's a vowel and what isn't. That is horta's accent and that's horta's decision.

ForNeVeR avatar Mar 06 '16 04:03 ForNeVeR

I've come up with another plan. We could use something like Unidecode (I hope there is a Java port; I've seen Python and .NET variants in the wild, so we can wrap them into some Grasshopper / Jython sheet if nothing else is possible). It's documented as "quick and dirty", and that's exactly what we need. We could then check whether its output for every letter contains any English vowels, and that's it.

ForNeVeR avatar Mar 06 '16 04:03 ForNeVeR

That's a pretty nice idea, except that it tries to conserve graphical representation instead of phonetic. Which, sadly, makes it unfeasible. Anyway, the point I tried to communicate is that the very nature of a vowel/consonant dychotomy is dependent on the pronunciation of the phoneme, not the shape or class of a glyph representing said phoneme most encodings (Unicode nonwithstanding) are concerned about. So, as I said before whatever works, works well enough. If you are concerned about disturbing the peace of the users with obscure scripts in nicknames (and I'm not saying you shouldn't be), then maybe horta should replace first known vowel or third (alternatively, second-to-last) letter in a name, whichever comes first.

hagane avatar Mar 06 '16 04:03 hagane

That's the whole concept of "known vowels" I was trying to avoid (better to say, I've tried to make it be solved by some third party), but now I am not sure. Thanks for the discussion that've given me that doubt.

Currently we have already implemented something like that algorithm you mentioned. Although we're replacing the second character if no known vowels were found, without too much fuss about "third or second to last". That's because if we haven't found anything of interest - we can surely say that user's nickname is pretty much fucked up and we probably won't be able to make it up properly anyways.

ForNeVeR avatar Mar 06 '16 05:03 ForNeVeR

Great minds think alike then. I do not see any necessity in further inquiry into this matter beyond a sudden urge to play with linguistics and heuristics (which is a perfectly valid reason, mind you).

hagane avatar Mar 06 '16 05:03 hagane