elasticsearch-ingest-langdetect icon indicating copy to clipboard operation
elasticsearch-ingest-langdetect copied to clipboard

Fail to detect Chinese language

Open liu-xiao-guo opened this issue 5 years ago • 2 comments

I just tried the following example with Chinese lanaguge:

PUT my-index/_doc/2?pipeline=langdetect-pipeline { "my_field": "我爱北京天安门" }

Then, I tried to use the following one to retrieve the document:

GET my-index/_doc/2

The returned result was:

{ "_index" : "my-index", "_type" : "_doc", "_id" : "2", "_version" : 1, "_seq_no" : 2, "_primary_term" : 1, "found" : true, "_source" : { "my_field" : "我爱北京天安门", "language" : "ko" } }

It showed "ko" (korean) language instead of Chinese language.

liu-xiao-guo avatar Feb 10 '20 09:02 liu-xiao-guo

PUT my-index/_doc/3?pipeline=langdetect-pipeline { "my_field": "hello" }

It returns:

{ "_index" : "my-index", "_type" : "_doc", "_id" : "3", "_version" : 4, "_seq_no" : 16, "_primary_term" : 1, "found" : true, "_source" : { "my_field" : "hello", "language" : "fi" } }

liu-xiao-guo avatar Feb 10 '20 11:02 liu-xiao-guo

Hi,

I'm doing a bit of digging in this plugin. Seems like it's a wrapper around this library:

com.youcruit.com.cybozu.labs:langdetect https://github.com/YouCruit/language-detection/

Which is a fork of a fork of this project: https://github.com/shuyo/language-detection. Which is also used in this plugin: https://github.com/jprante/elasticsearch-langdetect. That one is only compatible with Elastic 5.x, so the spinscale plugin is bringing compatibility to Elastic 6 and 7. But under the hood, the language detection mechanism is basically identical.

Your problem has to do with the performance of shuyo/language-detection. There are similar issues posted in the issue queue of that project: https://github.com/shuyo/language-detection/issues

Your second example outputs what I'd expect. "Hello" is used in many languages. Without any context, the plugin will just pick a language that matches with "hello". If you'd give it a more meaningful string - i.e. "Hello, my name is liu-xiao-guo" it's going to be able to be more accurate.

Your first example, looks like Chinese is a difficult one to detect. The shoyu library seems to be based on building a crude language model for 50 languages that detects the features of each language. There's a nice presentation to be found in their documentation: https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md

netsensei avatar Feb 10 '20 13:02 netsensei

You can try the lingua engine alternatively nowadays. Closing this, as this is an upstream issue...

spinscale avatar Aug 18 '22 10:08 spinscale