fuzzywuzzy icon indicating copy to clipboard operation
fuzzywuzzy copied to clipboard

Best approach when choices have million strings

Open ElianoMarques opened this issue 7 years ago • 3 comments

Hi,

In a scenario where i'm doing:

process.extract("AJUNT",choices, limit=5)

But choices contain millions of records, would this still be the best approach?

Thank you, Eliano

ElianoMarques avatar Jul 06 '17 08:07 ElianoMarques

I would also be interested in finding the best approach to this. It seems this could be a good candidate for parallelization via GPU in CUDA. One could possibly parallelize this task in Python Dask as well, and there are many approaches for doing this. I imagine in Python Dask the list of a million choices could be split up into smaller lists, run the extractOne computation on each one in parallel, aggregate the results into a new list and then perform the final extractOne operation on this list - for example.

Shellcat-Zero avatar Apr 24 '18 22:04 Shellcat-Zero

Any progress on this?

wielrenner avatar Oct 20 '18 09:10 wielrenner

@wielrenner I ended up using https://github.com/nullnull/simstring. Searches over a million strings take milliseconds. Standard NLP pre-processing is still appropriate, especially removing unnecessary tokens from your search string which will improve search times - the longer your search string the longer the search time. The results of simstring can then be improved with the extractOne function on fuzzywuzzy, as it performs well on a smaller dataset.

Otherwise, there is a highly scalable Levenshtein implementation discussed here: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

Shellcat-Zero avatar Oct 20 '18 11:10 Shellcat-Zero