cqengine
cqengine copied to clipboard
Levenstein distance support
Hi The Levenstein distance index is quite simple to implement (see ruslansennov/cqengine@18b8546c38f9c7e2d4ee28e408c63f46fe2643b5). To make it fast I suggest using third-party library that constructs immutable precalculated automata. If it is interesting for you and can be merged to master, I will prepare a more well-formed PR
Hi ruslansennov,
Yes this would definitely be a useful addition.
I'd be happy to accept PRs to add a feature like this - however the question is whether to add this as part of CQEngine's core, or as an optional module (e.g. cqengine-levenstein).
I'd definitely like this feature to be easily accessible to all users, but my main concern with that particular automata library is the number of transitive dependencies it has: Guava, Google protobuf, commons-lang, SLF4J etc.
I've made an effort to keep CQEngine's own dependencies as small as possible to keep the requirements of CQEngine to a minimum and not to conflict with a particular versions of popular libraries (such as Guava etc.) that applications might be using.
Is that the only (or simply the best) library for levenstein automata? Or are there reasonable alternatives with fewer dependencies?
If you really want to use that one, we could maybe create an optional module cqengine-levenstein, perhaps similar to the existing cqengine-streamsupport module.
Is that the only (or simply the best) library for levenstein automata? Or are there reasonable alternatives with fewer dependencies?
Unfortunately I know nothing about others java-libraries which uses automata for calculating levenstein distance. We use the specified library in the production code, and it's really fast.
I understand your opinion about the dependencies and share it :)
It seems that moving LevensteinIndex
to own module is the simplest way and it can be done in a few hours. Also we can re-implement the automata (without a huge list of dependencies), but this is a long story.
@ruslansennov @npgall - I think this request aims to bring cqengine on little new level. I would instead on supporting levenshtein algo as single stuff (I like more damerau evoluted version anyway :-) ), to support some kind of fuzzy logic in general as kind of fuzzy extension integrated within query language.
There are broad libraries to discover, I also agree that main aim is to get pur algo without much or none dependencies on 3rd party libs which will just affect performance, I found these: https://www.fuzzylite.com/java/ http://jfuzzylogic.sourceforge.net/html/index.html https://github.com/sorend/fuzzy4j
I don't have time now to discover which one is best to use, we can also cut only specific algo implementation out of these projects in necessary.
@npgall @archenroot At the moment I removed from liblevenshtein's dependencies list most of third-party libs except it.unimi.dsi:fastutil
, because it should be benchmarked. See liblevenshtein-java pure branch
@npgall, hi! I'm speaking on behalf of @ruslansennov as we've recently become colleagues :)
We've done some measurements to find out if there is a significant performance gap in liblevenshtein
when using fastutil
and standard Java collections. It seems that there is a small difference, with the solution based on standard java.util collections being just slightly slower than the original version (in artificial benchmarks it works at around 80-90% of original's rate).
This led us to the conclusion that:
- in most realistic scenarios a 10-20% performance penalty is going to be tolerable
- we can safely remove all usages of specialized collections from
liblevenshtein
codebase and remove the dependency onfastutil
We don't yet have a PR for you to review, as some additional effort will be required to cleanup the remaining usages of fastutil
classes (i.e. not the collections, but other library-specific artifacts). Before investing more time into this task we would like to receive some sort of confirmation from you, that a dependency-free liblevenshtein
-based index is really what you want (i.e. you don't mind the performance penalty).
Another option would be to keep fastutil
for slightly better performance and place the levenshtein-based index into an optional module like was discussed above.
Aside from fastutil
there is also slf4j
, but we are ready to remove it as well (there are only a few usages).
Please let us know your thoughts :)
Hi!
I am open to both options :)
If you would like to have your new index merged into CQEngine itself and then shipped as part of the core CQEngine library, then I would totally support that, and this would be my preferred option as it would be easiest to manage and keep in sync with changes in CQEngine. But this option means we need a solution which keeps the number of dependencies required by the index to a reasonable minimum (for the reasons I mentioned above).
On the other hand, if you prefer to have the better performance by using fastutil and other libraries, then I'd try to give as much support as I can for that, but it would be maintained in a separate project. End users who want to use this index could then declare a dependency on it in their project (and would get all of the dependencies also). I'd be happy to link to the other project from the CQEngine documentation of course.
So whichever option you want to go with, I'll try to help out!
Best regards, Niall
@npgall , thank you for the answer! Good news is that we've been able to get rid of all dependencies without incurring any penalty:)
We'd like to know if you are comfortable with using jitpack.io for retrieving dependencies from Github? Or you'd rather prefer us to release our fork of liblevenshtein
to Maven Central?
I think it's best if you could deploy the dependencies to Maven Central. I think this will be important to some organizations who use CQEngine that CQEngine's dependencies will all be in Maven Central, because of the immutability and reproducible builds etc.