elasticsearch-langdetect icon indicating copy to clipboard operation
elasticsearch-langdetect copied to clipboard

Accuracy problem with attachment

Open richardwilly98 opened this issue 11 years ago • 3 comments

Here is a list of documents [1] that have been detected as none-english (folder name = language detected).

Is there a way to improve accuracy?

[1] - https://dl.dropboxusercontent.com/u/64847502/langdetect-sample.zip

richardwilly98 avatar Jan 14 '14 13:01 richardwilly98

I have just tested all documents using Tika command line utility and they all return the correct language: en.

richardwilly98 avatar Jan 14 '14 13:01 richardwilly98

Thanks!

I will try to reproduce the issue.

jprante avatar Jan 14 '14 14:01 jprante

The langdetect plugin also returns the language en in my tests.

Additionally, the languages pl and so are discovered, but with much less probability.

I think I should add a strict parameter to the plugin so that langdetect returns only the language with highest probability.

jprante avatar Jan 16 '14 22:01 jprante