lingua-py icon indicating copy to clipboard operation
lingua-py copied to clipboard

Single word greeting detection issue

Open AlexUmnov opened this issue 3 months ago • 3 comments

Based on the reported graphs I was expecting a high single-word detection accuracy, however when I tested some simple greetings, results were quite poor. I'm thinking that I might have done something wrong, so let me know if it's the case, or maybe it is indeed a bug.

>>> lingua_detector_with_high_accuracy = (
...     LanguageDetectorBuilder.from_all_languages()
...     .with_preloaded_language_models()
...     .build()
... )
>>> lingua_detector_with_high_accuracy.detect_languages_in_parallel_of(["Hi", "Hello", "Hoi", "Bonjour", "Hola"])
[Language.MAORI, Language.SOTHO, Language.IRISH, Language.FRENCH, Language.SOTHO]

I was expecting [English, English, Dutch (although questionable), French, Spanish]

And if I look at the list of confidences, the correct answer is not even close to the top

>>> lingua_detector_with_high_accuracy.compute_language_confidence_values("Hi")
[ConfidenceValue(language=Language.MAORI, value=0.06075102321391407), ConfidenceValue(language=Language.TSONGA, value=0.05757593734895124), ConfidenceValue(language=Language.SWAHILI, value=0.0540818929944233), ConfidenceValue(language=Language.ZULU, value=0.049523395411588976), ConfidenceValue(language=Language.SHONA, value=0.04702462087367202), ConfidenceValue(language=Language.XHOSA, value=0.036701774614104545), ConfidenceValue(language=Language.VIETNAMESE, value=0.036454756688253924), ConfidenceValue(language=Language.TAGALOG, value=0.03404539900279216), ConfidenceValue(language=Language.SOMALI, value=0.03317725625173807), ConfidenceValue(language=Language.ENGLISH, value=0.030830400551457332), ConfidenceValue(language=Language.BASQUE, value=0.026533171189533463), ConfidenceValue(language=Language.LATIN, value=0.026244508481785906), ConfidenceValue(language=Language.ALBANIAN, value=0.02511723636400765), ConfidenceValue(language=Language.ITALIAN, value=0.024455484640574215), ConfidenceValue(language=Language.IRISH, value=0.023256241641364344), ConfidenceValue(language=Language.ESTONIAN, value=0.022259562180598012), ConfidenceValue(language=Language.ROMANIAN, value=0.020909217657987457), ConfidenceValue(language=Language.FINNISH, value=0.020367779172509735), ConfidenceValue(language=Language.WELSH, value=0.02029625722550339), ConfidenceValue(language=Language.GERMAN, value=0.019098590424106467), ConfidenceValue(language=Language.MALAY, value=0.01907087564683017), ConfidenceValue(language=Language.DUTCH, value=0.018458932009078492), ConfidenceValue(language=Language.TURKISH, value=0.018417733731387786), ConfidenceValue(language=Language.INDONESIAN, value=0.017490145400531204), ConfidenceValue(language=Language.SOTHO, value=0.016289776285211752), ConfidenceValue(language=Language.CATALAN, value=0.015694285948878523), ConfidenceValue(language=Language.AZERBAIJANI, value=0.015637273956743414), ConfidenceValue(language=Language.AFRIKAANS, value=0.013931939972393599), ConfidenceValue(language=Language.ESPERANTO, value=0.013415395979981521), ConfidenceValue(language=Language.YORUBA, value=0.013067266302647642), ConfidenceValue(language=Language.FRENCH, value=0.012468853625883245), ConfidenceValue(language=Language.TSWANA, value=0.012453459137669659), ConfidenceValue(language=Language.ICELANDIC, value=0.011470953898403368), ConfidenceValue(language=Language.SPANISH, value=0.011448091245211359), ConfidenceValue(language=Language.BOSNIAN, value=0.011265460852164393), ConfidenceValue(language=Language.SLOVENE, value=0.010800718904958878), ConfidenceValue(language=Language.HUNGARIAN, value=0.009424118815082561), ConfidenceValue(language=Language.POLISH, value=0.009285432919004301), ConfidenceValue(language=Language.DANISH, value=0.009279214552941634), ConfidenceValue(language=Language.CROATIAN, value=0.008879011852204329), ConfidenceValue(language=Language.PORTUGUESE, value=0.008803596143923034), ConfidenceValue(language=Language.SWEDISH, value=0.008439835711570882), ConfidenceValue(language=Language.NYNORSK, value=0.00801218793637737), ConfidenceValue(language=Language.BOKMAL, value=0.00760286119449106), ConfidenceValue(language=Language.LITHUANIAN, value=0.006895045298797671), ConfidenceValue(language=Language.SLOVAK, value=0.0063550870869929655), ConfidenceValue(language=Language.CZECH, value=0.0061009394153884525), ConfidenceValue(language=Language.GANDA, value=0.005910373270141214), ConfidenceValue(language=Language.LATVIAN, value=0.004926626976243253), ConfidenceValue(language=Language.ARABIC, value=0), ConfidenceValue(language=Language.ARMENIAN, value=0), ConfidenceValue(language=Language.BELARUSIAN, value=0), ConfidenceValue(language=Language.BENGALI, value=0), ConfidenceValue(language=Language.BULGARIAN, value=0), ConfidenceValue(language=Language.CHINESE, value=0), ConfidenceValue(language=Language.GEORGIAN, value=0), ConfidenceValue(language=Language.GREEK, value=0), ConfidenceValue(language=Language.GUJARATI, value=0), ConfidenceValue(language=Language.HEBREW, value=0), ConfidenceValue(language=Language.HINDI, value=0), ConfidenceValue(language=Language.JAPANESE, value=0), ConfidenceValue(language=Language.KAZAKH, value=0), ConfidenceValue(language=Language.KOREAN, value=0), ConfidenceValue(language=Language.MACEDONIAN, value=0), ConfidenceValue(language=Language.MARATHI, value=0), ConfidenceValue(language=Language.MONGOLIAN, value=0), ConfidenceValue(language=Language.PERSIAN, value=0), ConfidenceValue(language=Language.PUNJABI, value=0), ConfidenceValue(language=Language.RUSSIAN, value=0), ConfidenceValue(language=Language.SERBIAN, value=0), ConfidenceValue(language=Language.TAMIL, value=0), ConfidenceValue(language=Language.TELUGU, value=0), ConfidenceValue(language=Language.THAI, value=0), ConfidenceValue(language=Language.UKRAINIAN, value=0), ConfidenceValue(language=Language.URDU, value=0)]
>>> lingua_detector_with_high_accuracy.compute_language_confidence_values("Hello")
[ConfidenceValue(language=Language.SOTHO, value=0.173325243584432), ConfidenceValue(language=Language.ITALIAN, value=0.09272219162074988), ConfidenceValue(language=Language.WELSH, value=0.06825833486393992), ConfidenceValue(language=Language.SPANISH, value=0.06677864210729946), ConfidenceValue(language=Language.ALBANIAN, value=0.04990207735827293), ConfidenceValue(language=Language.ENGLISH, value=0.041839813524991075), ConfidenceValue(language=Language.TAGALOG, value=0.03870323702818158), ConfidenceValue(language=Language.NYNORSK, value=0.034029542384151165), ConfidenceValue(language=Language.LATIN, value=0.03129121426048623), ConfidenceValue(language=Language.FINNISH, value=0.029898688702490947), ConfidenceValue(language=Language.BOKMAL, value=0.02726473404767691), ConfidenceValue(language=Language.YORUBA, value=0.02707114072749414), ConfidenceValue(language=Language.TSWANA, value=0.02466907769307024), ConfidenceValue(language=Language.ESPERANTO, value=0.021221681760239662), ConfidenceValue(language=Language.CATALAN, value=0.017699117373921606), ConfidenceValue(language=Language.FRENCH, value=0.017589289608133957), ConfidenceValue(language=Language.PORTUGUESE, value=0.015427409893015501), ConfidenceValue(language=Language.SOMALI, value=0.013673047038079464), ConfidenceValue(language=Language.SLOVAK, value=0.012188085792055086), ConfidenceValue(language=Language.SLOVENE, value=0.011811045889648599), ConfidenceValue(language=Language.HUNGARIAN, value=0.011520976616457797), ConfidenceValue(language=Language.DUTCH, value=0.011398474543927499), ConfidenceValue(language=Language.TSONGA, value=0.011278650365814014), ConfidenceValue(language=Language.CROATIAN, value=0.010052136489811837), ConfidenceValue(language=Language.GERMAN, value=0.009947842382616414), ConfidenceValue(language=Language.POLISH, value=0.009683463831234633), ConfidenceValue(language=Language.CZECH, value=0.00966648103002121), ConfidenceValue(language=Language.SWEDISH, value=0.009333296758361522), ConfidenceValue(language=Language.BASQUE, value=0.008657878718761784), ConfidenceValue(language=Language.ZULU, value=0.007878128985814107), ConfidenceValue(language=Language.DANISH, value=0.007832903262813228), ConfidenceValue(language=Language.AFRIKAANS, value=0.007717663405852523), ConfidenceValue(language=Language.ESTONIAN, value=0.006273352623560222), ConfidenceValue(language=Language.MAORI, value=0.006016957945887504), ConfidenceValue(language=Language.ROMANIAN, value=0.005807591485948007), ConfidenceValue(language=Language.VIETNAMESE, value=0.005382639283154292), ConfidenceValue(language=Language.GANDA, value=0.004951897453654046), ConfidenceValue(language=Language.IRISH, value=0.00493023627782643), ConfidenceValue(language=Language.INDONESIAN, value=0.004615785769766443), ConfidenceValue(language=Language.BOSNIAN, value=0.004381433147220411), ConfidenceValue(language=Language.SHONA, value=0.0040097769074464665), ConfidenceValue(language=Language.ICELANDIC, value=0.003966981464812075), ConfidenceValue(language=Language.LITHUANIAN, value=0.0037890363336309974), ConfidenceValue(language=Language.MALAY, value=0.0036367685783141434), ConfidenceValue(language=Language.LATVIAN, value=0.0028535014062182856), ConfidenceValue(language=Language.XHOSA, value=0.002828662384197751), ConfidenceValue(language=Language.TURKISH, value=0.0027082980514884845), ConfidenceValue(language=Language.SWAHILI, value=0.0025511928029045322), ConfidenceValue(language=Language.AZERBAIJANI, value=0.0009643764341532832), ConfidenceValue(language=Language.ARABIC, value=0), ConfidenceValue(language=Language.ARMENIAN, value=0), ConfidenceValue(language=Language.BELARUSIAN, value=0), ConfidenceValue(language=Language.BENGALI, value=0), ConfidenceValue(language=Language.BULGARIAN, value=0), ConfidenceValue(language=Language.CHINESE, value=0), ConfidenceValue(language=Language.GEORGIAN, value=0), ConfidenceValue(language=Language.GREEK, value=0), ConfidenceValue(language=Language.GUJARATI, value=0), ConfidenceValue(language=Language.HEBREW, value=0), ConfidenceValue(language=Language.HINDI, value=0), ConfidenceValue(language=Language.JAPANESE, value=0), ConfidenceValue(language=Language.KAZAKH, value=0), ConfidenceValue(language=Language.KOREAN, value=0), ConfidenceValue(language=Language.MACEDONIAN, value=0), ConfidenceValue(language=Language.MARATHI, value=0), ConfidenceValue(language=Language.MONGOLIAN, value=0), ConfidenceValue(language=Language.PERSIAN, value=0), ConfidenceValue(language=Language.PUNJABI, value=0), ConfidenceValue(language=Language.RUSSIAN, value=0), ConfidenceValue(language=Language.SERBIAN, value=0), ConfidenceValue(language=Language.TAMIL, value=0), ConfidenceValue(language=Language.TELUGU, value=0), ConfidenceValue(language=Language.THAI, value=0), ConfidenceValue(language=Language.UKRAINIAN, value=0), ConfidenceValue(language=Language.URDU, value=0)]

In general, why I think this is important, is because it makes it impossible to use this detector in a multilingual chatbot scenario, where you have to determine a language in the beginning of the chat and change behaviour depending on that detection (i.e. say supported or unsupported, change available intents, etc. )

AlexUmnov avatar Mar 22 '24 10:03 AlexUmnov