elasticsearch-plugin-bundle icon indicating copy to clipboard operation
elasticsearch-plugin-bundle copied to clipboard

Settings definition different between orginal langdetect and bundle

Open marbleman opened this issue 8 years ago • 11 comments

Had a hard time today figuring out why my application slowed down around 20 times. After a lot of profiling I found langdetect to be the issue. Finally compared orginal langdetect plugin and the plugin-bundle and wrote a unitTests to measure execution time.

The reason is quite simple: orignal langdetect plugin assumes settings as langdetect.languages = en,de,fr while the plugin-bundle wants to see languages = en,de,fr in elasticsearch.yml

This applies to all settings (compare src\main\java\org\xbib\elasticsearch\module\langdetect\LangdetectService.java for details)

Is this intended? If yes, I will push an update to the docs...

BTW: I also tried the parameter ?profile=/langdetect/short-text/ since it appeared to me it could speed up detection (probably at cost of accuracy). But in all my tries I always got "profile": "/langdetect/" returned.

marbleman avatar Mar 08 '16 23:03 marbleman

You're working so hard to find the differences between those two incarnations of the plugin... this helps a lot in aligning them!

Surely differences were never intended, codebases should be the same. The reason why they diverge was focusing on the "bundle" for a more comprehensive installation in my production environment, leaving the "langdetect-plugin" a bit behind. I got some internal feedback for the "bundle" that never made it back to the other version. Sorry for the mess.

BTW there are also some junit tests missing in the "bundle" which are present in "langdetect-plugin".

jprante avatar Mar 08 '16 23:03 jprante

Would have saved a lot of time if I had the idea to compare the two code folders earlier... ;)

Which codebase is intended to be the Master? I guess the single plugins since they carry the most detailed documentation, right?

I am not a Java developer by nature so it will take me a lot of effort and time to set up a functional development environment for all this stuff. I promise, I will do some time ;) Maybe you have a good tip for a starting point/howto. I wrote the unitTest mentioned against the PHP implementation though.

So for now all I can offer is to help with the docs and testing. Is there a way to get notifications on changes similar to code reviews? This would help to check immediately when implementation and documentaion go out of sync. Would rather invest the time here where everyone benefits than spending hours in reverse engineering on issues like the one above... ;)

marbleman avatar Mar 09 '16 09:03 marbleman

I see you are investing a lot of your time into langdetect right now, so will do the alignment of both codebases in the next hours, in the hope I can clear up the mess a bit. There are parts in both which belong to current state.

Watching a github project should give you notifications about commits, but I'm not sure :(

jprante avatar Mar 09 '16 09:03 jprante

I'll give it a try. Let me know if I can be of any help.

marbleman avatar Mar 09 '16 11:03 marbleman

BTW: I figured out that reducing the languages to test as described above will leed to wrong results instead of no result or at least a low probability:

E.g. I limit detection to de,en and send in a french text. The result gives me "en" with a probability of 0.99!

marbleman avatar Mar 09 '16 11:03 marbleman

First commit is here in my alignment quest.

https://github.com/jprante/elasticsearch-langdetect/commit/ba72272857e17599d8191b88efae5cb3d8e45246

Plugin bundle will follow.

jprante avatar Mar 09 '16 15:03 jprante

So here is the second commit to align both langdetect

https://github.com/jprante/elasticsearch-plugin-bundle/commit/48b27ba2b36c2cbb5caf3fbaabe792366a8ce5fb

jprante avatar Mar 09 '16 21:03 jprante

And another one

https://github.com/jprante/elasticsearch-langdetect/commit/4eaff493a2201de83a69598a51d58924e29edee4

jprante avatar Mar 09 '16 21:03 jprante

Just came across something that confuses me: Thought you had mentioned you wanted to go for ISO-639-1 codes in langdetect (de, en... instead of ger,eng) ?

Current bundle 2.2.0.3 returns ger, eng...

marbleman avatar Mar 23 '16 23:03 marbleman

Oh, and I stumbled over some details regrading limiting the detected languages in yml that could use some extra documentation but the intention is still a bit unclear to me: I limited detection to de, en because detecting all languages takes too much time. Now I send a russian text in and get a probability of 0.999xxx for either de or en. Would expect a much lower probability or even an empty result instead. Am I wrong?

marbleman avatar Mar 23 '16 23:03 marbleman

I drilled down on the ISO issue and figured out that the repo already contains a language.json with de/en... while elasticsearch-plugin-bundle-2.2.0.3-plugin.zip still contains the old one with ger/eng...

marbleman avatar Mar 24 '16 18:03 marbleman