fuzzywuzzy
fuzzywuzzy copied to clipboard
Unicode/UTF-8 support
Hi, I want to contribute Unicode support for the library. Do you know how much/what have to be done for the feature?
None, I'm afraid. I'm unsure if the underlying Levenshtein implementation supports it.
Hmmm.. You said that you use the same Levenshtein library as the original fuzzywuzzy does. How does fuzzywuzzy handles Unicode?
I recall seatgeek/fuzzywuzzy supporting two implementations; it will use python-Levenshtein if installed. But from their past issues it seems that the library supports unicode.
I interpret that the Levenshtein implementation could work with any integral type, from https://github.com/Tmplt/fuzzywuzzy/blob/a4f8b717b3f30208436f82054413660a8d2f7613/include/levenshtein.h#L23-L32
So I figure the Python interop maps unicode characters to integral values? I cannot say for sure. I would personally start there: see how Unicode is treated when Python calls C code and figure out if the Levenshtein implementation have to be changed.
I guess there's no easy way to get Unicode support (Without switching std::string to a Unicode aware one for the entire library). Thanks .
On 20-10-2019 00:09, Martin Chang wrote:
I guess there's no easy way to get Unicode support (Without switching std::string to a Unicode aware one for the entire library). Thanks .
std::string
can still be used (at least on Linux-based distros), but there will be a lot of intermediatestd::wstring
instances along withstd::mbstowcs
calls. Likely some hacks to interface withlevenshtein.c
unless reimplemented, too.
I think the proper solution would be to use something like Qt's QString
and re-implement levenshtein.c to support it. No hacks and Qt is popular enough on *nix systems. (Windows will be a problem)
On 20-10-2019 21:38, Martin Chang wrote:
I think the proper solution would be to use something like Qt's
QString
and re-implement levenshtein.c to support it. I don't want to pull in Qt just for string comparison; Levenshtein could instead be reimplemented via templates where we let the user decide what string type they want to use.
Levenshtein could instead be reimplemented
I'll look into that.