receipt-parser-legacy icon indicating copy to clipboard operation
receipt-parser-legacy copied to clipboard

Fix failing unit test for market name

Open mre opened this issue 5 years ago • 3 comments

In https://github.com/mre/receipt-parser/pull/23#issuecomment-497964607, @kiwita88 found the likely reason why our unit test for the market name 'p e n ny' fails. We should fix that.

mre avatar Jun 02 '19 23:06 mre

difflib is what is currently in use for this buggy feature. The diffing algorithm used by difflib is called Ratcliff-Obershelp and seems to be generic in regards to data type (binary data, strings, etc.). There are better algorithms for determining fuzzy string similarity such as Levenshtein. I believe switching algorithms is the best solution here.

What do you think, @mre? I could be convinced to write up a PR if no other contributor can. If you're comfortable adding a dependency, it might make sense to lean on https://github.com/seatgeek/fuzzywuzzy for this too.

rayrrr avatar Sep 07 '19 12:09 rayrrr

Update: the existing packages I mentioned are GPLv2 licensed which may not be desired so perhaps just a direct implementation of the Levenshtein algorithm could be added for this feature. Plenty of inspiration is available.

rayrrr avatar Sep 07 '19 14:09 rayrrr

Hey @rayrr, thanks for your input. Yes, switching to Levenshtein would be worth a try. Whether we use a library or not doesn't matter to me. Also GPLv2 is fine in my book. So if you like and you find the time, please go ahead and whip up a PR for this. 👍

mre avatar Sep 08 '19 03:09 mre