elasticsearch-langdetect
elasticsearch-langdetect copied to clipboard
Accuracy problem with attachment
Here is a list of documents [1] that have been detected as none-english (folder name = language detected).
Is there a way to improve accuracy?
[1] - https://dl.dropboxusercontent.com/u/64847502/langdetect-sample.zip
I have just tested all documents using Tika command line utility and they all return the correct language: en
.
Thanks!
I will try to reproduce the issue.
The langdetect plugin also returns the language en
in my tests.
Additionally, the languages pl
and so
are discovered, but with much less probability.
I think I should add a strict
parameter to the plugin so that langdetect returns only the language with highest probability.