receipt-parser-legacy
receipt-parser-legacy copied to clipboard
Fix failing unit test for market name
In https://github.com/mre/receipt-parser/pull/23#issuecomment-497964607, @kiwita88 found the likely reason why our unit test for the market name 'p e n ny' fails. We should fix that.
difflib
is what is currently in use for this buggy feature. The diffing algorithm used by difflib
is called Ratcliff-Obershelp and seems to be generic in regards to data type (binary data, strings, etc.). There are better algorithms for determining fuzzy string similarity such as Levenshtein. I believe switching algorithms is the best solution here.
What do you think, @mre? I could be convinced to write up a PR if no other contributor can. If you're comfortable adding a dependency, it might make sense to lean on https://github.com/seatgeek/fuzzywuzzy for this too.
Update: the existing packages I mentioned are GPLv2 licensed which may not be desired so perhaps just a direct implementation of the Levenshtein algorithm could be added for this feature. Plenty of inspiration is available.
Hey @rayrr, thanks for your input. Yes, switching to Levenshtein would be worth a try. Whether we use a library or not doesn't matter to me. Also GPLv2 is fine in my book. So if you like and you find the time, please go ahead and whip up a PR for this. 👍