lingua-py icon indicating copy to clipboard operation
lingua-py copied to clipboard

Single word greeting detection issue

Open AlexUmnov opened this issue 1 year ago • 3 comments

Based on the reported graphs I was expecting a high single-word detection accuracy, however when I tested some simple greetings, results were quite poor. I'm thinking that I might have done something wrong, so let me know if it's the case, or maybe it is indeed a bug.

>>> lingua_detector_with_high_accuracy = (
...     LanguageDetectorBuilder.from_all_languages()
...     .with_preloaded_language_models()
...     .build()
... )
>>> lingua_detector_with_high_accuracy.detect_languages_in_parallel_of(["Hi", "Hello", "Hoi", "Bonjour", "Hola"])
[Language.MAORI, Language.SOTHO, Language.IRISH, Language.FRENCH, Language.SOTHO]

I was expecting [English, English, Dutch (although questionable), French, Spanish]

And if I look at the list of confidences, the correct answer is not even close to the top

>>> lingua_detector_with_high_accuracy.compute_language_confidence_values("Hi")
[ConfidenceValue(language=Language.MAORI, value=0.06075102321391407), ConfidenceValue(language=Language.TSONGA, value=0.05757593734895124), ConfidenceValue(language=Language.SWAHILI, value=0.0540818929944233), ConfidenceValue(language=Language.ZULU, value=0.049523395411588976), ConfidenceValue(language=Language.SHONA, value=0.04702462087367202), ConfidenceValue(language=Language.XHOSA, value=0.036701774614104545), ConfidenceValue(language=Language.VIETNAMESE, value=0.036454756688253924), ConfidenceValue(language=Language.TAGALOG, value=0.03404539900279216), ConfidenceValue(language=Language.SOMALI, value=0.03317725625173807), ConfidenceValue(language=Language.ENGLISH, value=0.030830400551457332), ConfidenceValue(language=Language.BASQUE, value=0.026533171189533463), ConfidenceValue(language=Language.LATIN, value=0.026244508481785906), ConfidenceValue(language=Language.ALBANIAN, value=0.02511723636400765), ConfidenceValue(language=Language.ITALIAN, value=0.024455484640574215), ConfidenceValue(language=Language.IRISH, value=0.023256241641364344), ConfidenceValue(language=Language.ESTONIAN, value=0.022259562180598012), ConfidenceValue(language=Language.ROMANIAN, value=0.020909217657987457), ConfidenceValue(language=Language.FINNISH, value=0.020367779172509735), ConfidenceValue(language=Language.WELSH, value=0.02029625722550339), ConfidenceValue(language=Language.GERMAN, value=0.019098590424106467), ConfidenceValue(language=Language.MALAY, value=0.01907087564683017), ConfidenceValue(language=Language.DUTCH, value=0.018458932009078492), ConfidenceValue(language=Language.TURKISH, value=0.018417733731387786), ConfidenceValue(language=Language.INDONESIAN, value=0.017490145400531204), ConfidenceValue(language=Language.SOTHO, value=0.016289776285211752), ConfidenceValue(language=Language.CATALAN, value=0.015694285948878523), ConfidenceValue(language=Language.AZERBAIJANI, value=0.015637273956743414), ConfidenceValue(language=Language.AFRIKAANS, value=0.013931939972393599), ConfidenceValue(language=Language.ESPERANTO, value=0.013415395979981521), ConfidenceValue(language=Language.YORUBA, value=0.013067266302647642), ConfidenceValue(language=Language.FRENCH, value=0.012468853625883245), ConfidenceValue(language=Language.TSWANA, value=0.012453459137669659), ConfidenceValue(language=Language.ICELANDIC, value=0.011470953898403368), ConfidenceValue(language=Language.SPANISH, value=0.011448091245211359), ConfidenceValue(language=Language.BOSNIAN, value=0.011265460852164393), ConfidenceValue(language=Language.SLOVENE, value=0.010800718904958878), ConfidenceValue(language=Language.HUNGARIAN, value=0.009424118815082561), ConfidenceValue(language=Language.POLISH, value=0.009285432919004301), ConfidenceValue(language=Language.DANISH, value=0.009279214552941634), ConfidenceValue(language=Language.CROATIAN, value=0.008879011852204329), ConfidenceValue(language=Language.PORTUGUESE, value=0.008803596143923034), ConfidenceValue(language=Language.SWEDISH, value=0.008439835711570882), ConfidenceValue(language=Language.NYNORSK, value=0.00801218793637737), ConfidenceValue(language=Language.BOKMAL, value=0.00760286119449106), ConfidenceValue(language=Language.LITHUANIAN, value=0.006895045298797671), ConfidenceValue(language=Language.SLOVAK, value=0.0063550870869929655), ConfidenceValue(language=Language.CZECH, value=0.0061009394153884525), ConfidenceValue(language=Language.GANDA, value=0.005910373270141214), ConfidenceValue(language=Language.LATVIAN, value=0.004926626976243253), ConfidenceValue(language=Language.ARABIC, value=0), ConfidenceValue(language=Language.ARMENIAN, value=0), ConfidenceValue(language=Language.BELARUSIAN, value=0), ConfidenceValue(language=Language.BENGALI, value=0), ConfidenceValue(language=Language.BULGARIAN, value=0), ConfidenceValue(language=Language.CHINESE, value=0), ConfidenceValue(language=Language.GEORGIAN, value=0), ConfidenceValue(language=Language.GREEK, value=0), ConfidenceValue(language=Language.GUJARATI, value=0), ConfidenceValue(language=Language.HEBREW, value=0), ConfidenceValue(language=Language.HINDI, value=0), ConfidenceValue(language=Language.JAPANESE, value=0), ConfidenceValue(language=Language.KAZAKH, value=0), ConfidenceValue(language=Language.KOREAN, value=0), ConfidenceValue(language=Language.MACEDONIAN, value=0), ConfidenceValue(language=Language.MARATHI, value=0), ConfidenceValue(language=Language.MONGOLIAN, value=0), ConfidenceValue(language=Language.PERSIAN, value=0), ConfidenceValue(language=Language.PUNJABI, value=0), ConfidenceValue(language=Language.RUSSIAN, value=0), ConfidenceValue(language=Language.SERBIAN, value=0), ConfidenceValue(language=Language.TAMIL, value=0), ConfidenceValue(language=Language.TELUGU, value=0), ConfidenceValue(language=Language.THAI, value=0), ConfidenceValue(language=Language.UKRAINIAN, value=0), ConfidenceValue(language=Language.URDU, value=0)]
>>> lingua_detector_with_high_accuracy.compute_language_confidence_values("Hello")
[ConfidenceValue(language=Language.SOTHO, value=0.173325243584432), ConfidenceValue(language=Language.ITALIAN, value=0.09272219162074988), ConfidenceValue(language=Language.WELSH, value=0.06825833486393992), ConfidenceValue(language=Language.SPANISH, value=0.06677864210729946), ConfidenceValue(language=Language.ALBANIAN, value=0.04990207735827293), ConfidenceValue(language=Language.ENGLISH, value=0.041839813524991075), ConfidenceValue(language=Language.TAGALOG, value=0.03870323702818158), ConfidenceValue(language=Language.NYNORSK, value=0.034029542384151165), ConfidenceValue(language=Language.LATIN, value=0.03129121426048623), ConfidenceValue(language=Language.FINNISH, value=0.029898688702490947), ConfidenceValue(language=Language.BOKMAL, value=0.02726473404767691), ConfidenceValue(language=Language.YORUBA, value=0.02707114072749414), ConfidenceValue(language=Language.TSWANA, value=0.02466907769307024), ConfidenceValue(language=Language.ESPERANTO, value=0.021221681760239662), ConfidenceValue(language=Language.CATALAN, value=0.017699117373921606), ConfidenceValue(language=Language.FRENCH, value=0.017589289608133957), ConfidenceValue(language=Language.PORTUGUESE, value=0.015427409893015501), ConfidenceValue(language=Language.SOMALI, value=0.013673047038079464), ConfidenceValue(language=Language.SLOVAK, value=0.012188085792055086), ConfidenceValue(language=Language.SLOVENE, value=0.011811045889648599), ConfidenceValue(language=Language.HUNGARIAN, value=0.011520976616457797), ConfidenceValue(language=Language.DUTCH, value=0.011398474543927499), ConfidenceValue(language=Language.TSONGA, value=0.011278650365814014), ConfidenceValue(language=Language.CROATIAN, value=0.010052136489811837), ConfidenceValue(language=Language.GERMAN, value=0.009947842382616414), ConfidenceValue(language=Language.POLISH, value=0.009683463831234633), ConfidenceValue(language=Language.CZECH, value=0.00966648103002121), ConfidenceValue(language=Language.SWEDISH, value=0.009333296758361522), ConfidenceValue(language=Language.BASQUE, value=0.008657878718761784), ConfidenceValue(language=Language.ZULU, value=0.007878128985814107), ConfidenceValue(language=Language.DANISH, value=0.007832903262813228), ConfidenceValue(language=Language.AFRIKAANS, value=0.007717663405852523), ConfidenceValue(language=Language.ESTONIAN, value=0.006273352623560222), ConfidenceValue(language=Language.MAORI, value=0.006016957945887504), ConfidenceValue(language=Language.ROMANIAN, value=0.005807591485948007), ConfidenceValue(language=Language.VIETNAMESE, value=0.005382639283154292), ConfidenceValue(language=Language.GANDA, value=0.004951897453654046), ConfidenceValue(language=Language.IRISH, value=0.00493023627782643), ConfidenceValue(language=Language.INDONESIAN, value=0.004615785769766443), ConfidenceValue(language=Language.BOSNIAN, value=0.004381433147220411), ConfidenceValue(language=Language.SHONA, value=0.0040097769074464665), ConfidenceValue(language=Language.ICELANDIC, value=0.003966981464812075), ConfidenceValue(language=Language.LITHUANIAN, value=0.0037890363336309974), ConfidenceValue(language=Language.MALAY, value=0.0036367685783141434), ConfidenceValue(language=Language.LATVIAN, value=0.0028535014062182856), ConfidenceValue(language=Language.XHOSA, value=0.002828662384197751), ConfidenceValue(language=Language.TURKISH, value=0.0027082980514884845), ConfidenceValue(language=Language.SWAHILI, value=0.0025511928029045322), ConfidenceValue(language=Language.AZERBAIJANI, value=0.0009643764341532832), ConfidenceValue(language=Language.ARABIC, value=0), ConfidenceValue(language=Language.ARMENIAN, value=0), ConfidenceValue(language=Language.BELARUSIAN, value=0), ConfidenceValue(language=Language.BENGALI, value=0), ConfidenceValue(language=Language.BULGARIAN, value=0), ConfidenceValue(language=Language.CHINESE, value=0), ConfidenceValue(language=Language.GEORGIAN, value=0), ConfidenceValue(language=Language.GREEK, value=0), ConfidenceValue(language=Language.GUJARATI, value=0), ConfidenceValue(language=Language.HEBREW, value=0), ConfidenceValue(language=Language.HINDI, value=0), ConfidenceValue(language=Language.JAPANESE, value=0), ConfidenceValue(language=Language.KAZAKH, value=0), ConfidenceValue(language=Language.KOREAN, value=0), ConfidenceValue(language=Language.MACEDONIAN, value=0), ConfidenceValue(language=Language.MARATHI, value=0), ConfidenceValue(language=Language.MONGOLIAN, value=0), ConfidenceValue(language=Language.PERSIAN, value=0), ConfidenceValue(language=Language.PUNJABI, value=0), ConfidenceValue(language=Language.RUSSIAN, value=0), ConfidenceValue(language=Language.SERBIAN, value=0), ConfidenceValue(language=Language.TAMIL, value=0), ConfidenceValue(language=Language.TELUGU, value=0), ConfidenceValue(language=Language.THAI, value=0), ConfidenceValue(language=Language.UKRAINIAN, value=0), ConfidenceValue(language=Language.URDU, value=0)]

In general, why I think this is important, is because it makes it impossible to use this detector in a multilingual chatbot scenario, where you have to determine a language in the beginning of the chat and change behaviour depending on that detection (i.e. say supported or unsupported, change available intents, etc. )

AlexUmnov avatar Mar 22 '24 10:03 AlexUmnov

I know about this problem. The current purely statistical approach does not produce good results for such short words. My plan is to include word lists for each language which contain greetings among other things. Greetings such as "Hi", however, are surely used in a lot of languages. Even if the library classifies it as English, it will not necessarily be an indicator for an English speaking customer in your chat. Please keep this in mind.

pemistahl avatar Mar 25 '24 09:03 pemistahl

Thanks @pemistahl If you have a more or less concrete plan, perhaps I can somehow contribute to this change? Also, I think that maybe in the case of "Hi" it's important to take into account a prior distribution of languages. I.e. how much people speak English in general, or in my particular case ratio of customers. I don't know if you would want to put this in the library, but perhaps I can renorm it myself, on my end.

AlexUmnov avatar Mar 25 '24 10:03 AlexUmnov

@pemistahl I don't know if this is related, but.. we process both short and long texts, and with short texts we observed a weird behaviour with some language combinations. I don't recall the exact sentence we stumbled upon, but it behaved like this:

  • the input is "Bonjour mesdames et messieurs" or a similar length, pretty obvious sentence
  • LanguageDetectorBuilder.from_all_languages() and from_all_spoken_languages() identify French correctly
  • if there's a choice between (Italian, French, English) French is identified correctly
  • in all above cases the probabilities are correct and given most language options they won't change meaningfully
  • BUT if a particular language is added to the set, for example Deutsch, either the detection is steered towards a plainly wrong choice(Deutsch or English in this case) or all four probabilities become very close to zero

x86Gr avatar Apr 18 '24 15:04 x86Gr