fuzzywuzzy icon indicating copy to clipboard operation
fuzzywuzzy copied to clipboard

FuzzyWuzzy MIT?

Open omkar-tenkale opened this issue 5 years ago • 11 comments

There's a mit version in python

Can we have the same for java?

The license is the biggest issue i and 90%other developers are facing

And the worst thing is there is no alternate library in java with bare minimum performance like this library

I've searched everywhere

Levenshtein distance port for java is available but it performs very poorly for use case when you match users input (2-3chars) with list of strings Eg matching "sai" with school names

omkar-tenkale avatar Aug 31 '20 11:08 omkar-tenkale

It is not possible to use python-Levenshtein in a MIT Licensed library, which is the reason FuzzyWuzzy has the GPL license. The MIT licensed version only uses difflib (slow), but is still a bit questionable, since it is based on a FuzzyWuzzy version, which was already GPL licensed (I do not think Seatgeek cares, since they originally released FuzzyWuzzy under the MIT licensed). I wrote a faster alternative implementation for C++/Python (https://github.com/maxbachmann/RapidFuzz). This implementation could be ported to Java. However since I am not very familiar with Java, this would require someone else to maintain the implementation (I am willing to help with questions regarding the algorithms)

maxbachmann avatar Apr 11 '21 11:04 maxbachmann

@maxbachmann One could also look into directly using the C code with a Java Native Interface without porting the code, i don't have a lot of experience what that means performance wise though

Chase22 avatar Jun 11 '21 06:06 Chase22

@Chase22 I do something similar already for Python. When using small strings with a fast similarity metric there is a relevant performance impact. However the main reason for this is that Python calls functions with a list of arguments and a hashmap of named arguments, which has to be parsed on each call.

I could think of the following advantages/disadvantages of the JNI:

+ probably less maintenance since it reuses a big part of the code. Note however that in Python the Wrapper to call the C++ code from Python is actually much bigger than the code (partially because much of the code is generated). The C++ library has around 5k lines of code, while the wrapper has over 50k lines of code.

- I guess it would have to be compiled for each platform, which can be a pain

-/+ performance wise I am unsure as well. The JNI might add relevant overhead (e.g. in case all strings have to be copied). However the algorithms make heavy use of bitwise operations, which might be slower in pure Java. So this might go either way.

maxbachmann avatar Jun 11 '21 07:06 maxbachmann