elasticsearch-ingest-langdetect
elasticsearch-ingest-langdetect copied to clipboard
Fail to detect Chinese language
I just tried the following example with Chinese lanaguge:
PUT my-index/_doc/2?pipeline=langdetect-pipeline { "my_field": "我爱北京天安门" }
Then, I tried to use the following one to retrieve the document:
GET my-index/_doc/2
The returned result was:
{ "_index" : "my-index", "_type" : "_doc", "_id" : "2", "_version" : 1, "_seq_no" : 2, "_primary_term" : 1, "found" : true, "_source" : { "my_field" : "我爱北京天安门", "language" : "ko" } }
It showed "ko" (korean) language instead of Chinese language.
PUT my-index/_doc/3?pipeline=langdetect-pipeline { "my_field": "hello" }
It returns:
{ "_index" : "my-index", "_type" : "_doc", "_id" : "3", "_version" : 4, "_seq_no" : 16, "_primary_term" : 1, "found" : true, "_source" : { "my_field" : "hello", "language" : "fi" } }
Hi,
I'm doing a bit of digging in this plugin. Seems like it's a wrapper around this library:
com.youcruit.com.cybozu.labs:langdetect https://github.com/YouCruit/language-detection/
Which is a fork of a fork of this project: https://github.com/shuyo/language-detection. Which is also used in this plugin: https://github.com/jprante/elasticsearch-langdetect. That one is only compatible with Elastic 5.x, so the spinscale plugin is bringing compatibility to Elastic 6 and 7. But under the hood, the language detection mechanism is basically identical.
Your problem has to do with the performance of shuyo/language-detection. There are similar issues posted in the issue queue of that project: https://github.com/shuyo/language-detection/issues
Your second example outputs what I'd expect. "Hello" is used in many languages. Without any context, the plugin will just pick a language that matches with "hello". If you'd give it a more meaningful string - i.e. "Hello, my name is liu-xiao-guo" it's going to be able to be more accurate.
Your first example, looks like Chinese is a difficult one to detect. The shoyu library seems to be based on building a crude language model for 50 languages that detects the features of each language. There's a nice presentation to be found in their documentation: https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
You can try the lingua engine alternatively nowadays. Closing this, as this is an upstream issue...