fuzzywuzzy icon indicating copy to clipboard operation
fuzzywuzzy copied to clipboard

Unicode/UTF-8 support

Open marty1885 opened this issue 5 years ago • 8 comments

Hi, I want to contribute Unicode support for the library. Do you know how much/what have to be done for the feature?

marty1885 avatar Oct 18 '19 14:10 marty1885

None, I'm afraid. I'm unsure if the underlying Levenshtein implementation supports it.

tmplt avatar Oct 18 '19 16:10 tmplt

Hmmm.. You said that you use the same Levenshtein library as the original fuzzywuzzy does. How does fuzzywuzzy handles Unicode?

marty1885 avatar Oct 19 '19 03:10 marty1885

I recall seatgeek/fuzzywuzzy supporting two implementations; it will use python-Levenshtein if installed. But from their past issues it seems that the library supports unicode.

I interpret that the Levenshtein implementation could work with any integral type, from https://github.com/Tmplt/fuzzywuzzy/blob/a4f8b717b3f30208436f82054413660a8d2f7613/include/levenshtein.h#L23-L32

So I figure the Python interop maps unicode characters to integral values? I cannot say for sure. I would personally start there: see how Unicode is treated when Python calls C code and figure out if the Levenshtein implementation have to be changed.

tmplt avatar Oct 19 '19 12:10 tmplt

I guess there's no easy way to get Unicode support (Without switching std::string to a Unicode aware one for the entire library). Thanks .

marty1885 avatar Oct 20 '19 07:10 marty1885

On 20-10-2019 00:09, Martin Chang wrote:

I guess there's no easy way to get Unicode support (Without switching std::string to a Unicode aware one for the entire library). Thanks . std::string can still be used (at least on Linux-based distros), but there will be a lot of intermediate std::wstring instances along with std::mbstowcs calls. Likely some hacks to interface with levenshtein.c unless reimplemented, too.

tmplt avatar Oct 20 '19 12:10 tmplt

I think the proper solution would be to use something like Qt's QString and re-implement levenshtein.c to support it. No hacks and Qt is popular enough on *nix systems. (Windows will be a problem)

marty1885 avatar Oct 21 '19 04:10 marty1885

On 20-10-2019 21:38, Martin Chang wrote:

I think the proper solution would be to use something like Qt's QString and re-implement levenshtein.c to support it. I don't want to pull in Qt just for string comparison; Levenshtein could instead be reimplemented via templates where we let the user decide what string type they want to use.

tmplt avatar Oct 21 '19 13:10 tmplt

Levenshtein could instead be reimplemented

I'll look into that.

marty1885 avatar Oct 22 '19 02:10 marty1885