elasticsearch-langdetect icon indicating copy to clipboard operation
elasticsearch-langdetect copied to clipboard

Accuracy problem

Open Nelrohd opened this issue 10 years ago • 7 comments

Hi,

I have some strange results when I use on french text:

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'je vend ma chemise verte' { "ok" : true, "languages" : [ { "language" : "nl", "probability" : 0.9999951375010268 } ] }

It's french and I get "nl". Something wrong?

Nelrohd avatar Mar 11 '14 01:03 Nelrohd

Short text is pretty hard to detect the language of. For instance, Google translate also detects your text as Dutch:

http://translate.google.com/#auto/en/je%20vend%20ma%20chemise%20verte

Generally I've found that anything shorter than 300 bytes (about 60 words) does not seem very reliable. I unfortunately haven't gathered any statistical data to find a good cutoff.

gibrown avatar Apr 12 '14 16:04 gibrown

Hi,

Do you intend to support the short-text profiles for this purpose ? (Distributed since 03/03/2014 https://code.google.com/p/language-detection/)

adrienschuler-zz avatar Nov 20 '14 12:11 adrienschuler-zz

That looks promising. Training data based on twitter corpus.

gibrown avatar Nov 20 '14 14:11 gibrown

1.4.0.1 released, with the setting "profile": "/langdetect/short-text/"

jprante avatar Nov 20 '14 20:11 jprante

Thanks for the quick answer and patch! It would be awesome if the "short-text" profile setting could be reachable from the REST API as well :)

adrienschuler-zz avatar Nov 25 '14 15:11 adrienschuler-zz

1.4.0.2 just released, it has another REST API command for switching profiles.

jprante avatar Nov 25 '14 23:11 jprante

Thanks a lot, a quick review already shows good improvements, such as:

curl -XPOST 'localhost:9200/_langdetect?pretty' -d  'je vend ma chemise verte'
{
    "profile" : "/langdetect/short-text/",
    "languages" : [ {
        "language" : "fr",
        "probability" : 0.5714283159213042
    }, {
        "language" : "nl",
        "probability" : 0.42857000187571836
    } ]
}

adrienschuler-zz avatar Nov 26 '14 11:11 adrienschuler-zz