pycountry icon indicating copy to clipboard operation
pycountry copied to clipboard

search_fuzzy should support typos and slight variations

Open timrichardson opened this issue 4 years ago • 6 comments

v 19.8.18 'united state of america' is a miss (LookupError) which surprised me. It should return United States of America.

timrichardson avatar May 31 '20 06:05 timrichardson

are you referring to the missing 's' in the search or is this about capitalization?

ctheune avatar Jul 02 '20 10:07 ctheune

missing s

On Thu, 2 Jul 2020 at 20:46, Christian Theune [email protected] wrote:

are you referring to the missing 's' in the search or is this about capitalization?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/flyingcircusio/pycountry/issues/34#issuecomment-652933423, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCNYSIEZS7NVIVN4I34SDRZRQR7ANCNFSM4NPBCUNA .

--

Tim Richardson CPA, Director GrowthPath. Finance transformation for SMEs via Cloud ERP, advanced reporting, CRM

Mobile: +61 423 091 732 Office/Reception: +61 3 8678 1850. Book call: https://vyte.in/growthpath/15 Timezone is Melbourne AU. See this link for international time planning: https://www.timeanddate.com/worldclock/meeting.html?year=2020&month=5&day=16&p1=152

GrowthPath Pty Ltd ABN 18100392326 Xero Gold Partner. Dear Inventory, Zoho Analytics and Cin7 Implementation Partner. Custom integration specialists.

http://www.growthpath.com.au/

timrichardson avatar Jul 02 '20 13:07 timrichardson

I was thinking to add scoring based on DiffLib in the standard library. But I haven't thought much about how this would fit with the existing 'fuzzy matches'. Do you consider the possible matches we currently get to be ranked? Because it is hard to score a proximity match based on DiffLib in a way that fits in with the current order of results. Your existing code has some heuristics for matching which make particularly sense for country names, yet my bug report is a bad miss. I think that scoring with DiffLib is genuine fuzzy logic, and that it should be a new method. It can be tweaked with heuristics. This would mean full backwards compatibility since the current matching wouldn't change; to use the new method means a new method.

What do you think?

On Thu, 2 Jul 2020 at 23:16, Tim Richardson [email protected] wrote:

missing s

On Thu, 2 Jul 2020 at 20:46, Christian Theune [email protected] wrote:

are you referring to the missing 's' in the search or is this about capitalization?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/flyingcircusio/pycountry/issues/34#issuecomment-652933423, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCNYSIEZS7NVIVN4I34SDRZRQR7ANCNFSM4NPBCUNA .

--

Tim Richardson CPA, Director GrowthPath. Finance transformation for SMEs via Cloud ERP, advanced reporting, CRM

Mobile: +61 423 091 732 Office/Reception: +61 3 8678 1850. Book call: https://vyte.in/growthpath/15 Timezone is Melbourne AU. See this link for international time planning: https://www.timeanddate.com/worldclock/meeting.html?year=2020&month=5&day=16&p1=152

GrowthPath Pty Ltd ABN 18100392326 Xero Gold Partner. Dear Inventory, Zoho Analytics and Cin7 Implementation Partner. Custom integration specialists.

http://www.growthpath.com.au/

--

Tim Richardson CPA, Director GrowthPath. Finance transformation for SMEs via Cloud ERP, advanced reporting, CRM

Mobile: +61 423 091 732 Office/Reception: +61 3 8678 1850. Book call: https://vyte.in/growthpath/15 Timezone is Melbourne AU. See this link for international time planning: https://www.timeanddate.com/worldclock/meeting.html?year=2020&month=5&day=16&p1=152

GrowthPath Pty Ltd ABN 18100392326 Xero Gold Partner. Dear Inventory, Zoho Analytics and Cin7 Implementation Partner. Custom integration specialists.

http://www.growthpath.com.au/

timrichardson avatar Jul 02 '20 13:07 timrichardson

Actually, Levensthein or a similar distance would be helpful but likely much much harder compute wise. We could take a look at https://stackoverflow.com/questions/20162894/alternative-to-levenshtein-and-trigram for example.

ctheune avatar Jul 02 '20 16:07 ctheune

Or the 'any partial substring' (from Sublime Text for example) search might be useful. But that will only compensate for missing characters, not if there are too many or if the order is wrong.

ctheune avatar Jul 02 '20 16:07 ctheune

It would be good to copy from https://github.com/life4/textdistance and https://github.com/jamesturk/jellyfish

BradKML avatar Oct 28 '21 15:10 BradKML