stringdist
stringdist copied to clipboard
Q-Gram Filtering
I was wondering if you've thought of including qgram filtering for edit distance in the stringdist package. Oftentimes users are only concerned with comparing strings that pass a certain similarity threshold, and qgram filtering allows them to do this significantly quicker than just calculating the levenstein distance on all the strings.
Nope, I haven't considered it yet, but I think it should be quite easy to implement by users on top of the existing functionality. Just compute the q-gram (or jaccard or cosine) distances and filter out candidates with too large a distance before computing an edit-like distance. This would linearize the code for the total distance calculation (stringdist runs multithreaded under the hood) but only for trivial operations.
I think that building it into for example the stringdist function would not make it a lot faster than that.
I'm still planning to write a vignette one day. I think this would be a nice example to add.
Yeah I was thinking more of a convenience function, because it's not an obvious concept for many users (or at least it took me a long time to grok). The only change in functionality would be if you wanted to include gapped qgrams, which are a bit better for filtering.
On Fri, Jun 12, 2015, 16:36 Mark van der Loo [email protected] wrote:
Nope, I haven't considered it yet, but I think it should be quite easy to implement by users on top of the existing functionality. Just compute the q-gram (or jaccard or cosine) distances and filter out candidates with too large a distance before computing an edit-like distance. This would linearize the code for the total distance calculation (stringdist runs multithreaded under the hood) but only for trivial operations.
I think that building it into for example the stringdist function would not make it a lot faster than that.
I'm still planning to write a vignette one day. I think this would be a nice example to add.
— Reply to this email directly or view it on GitHub https://github.com/markvanderloo/stringdist/issues/34#issuecomment-111596275 .