fuzzywuzzy
fuzzywuzzy copied to clipboard
Best approach when choices have million strings
Hi,
In a scenario where i'm doing:
process.extract("AJUNT",choices, limit=5)
But choices contain millions of records, would this still be the best approach?
Thank you, Eliano
I would also be interested in finding the best approach to this. It seems this could be a good candidate for parallelization via GPU in CUDA. One could possibly parallelize this task in Python Dask as well, and there are many approaches for doing this. I imagine in Python Dask the list of a million choices could be split up into smaller lists, run the extractOne computation on each one in parallel, aggregate the results into a new list and then perform the final extractOne operation on this list - for example.
Any progress on this?
@wielrenner I ended up using https://github.com/nullnull/simstring. Searches over a million strings take milliseconds. Standard NLP pre-processing is still appropriate, especially removing unnecessary tokens from your search string which will improve search times - the longer your search string the longer the search time. The results of simstring can then be improved with the extractOne function on fuzzywuzzy, as it performs well on a smaller dataset.
Otherwise, there is a highly scalable Levenshtein implementation discussed here: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata