stringdist icon indicating copy to clipboard operation
stringdist copied to clipboard

Q-Gram Filtering

Open gshotwell opened this issue 10 years ago • 2 comments

I was wondering if you've thought of including qgram filtering for edit distance in the stringdist package. Oftentimes users are only concerned with comparing strings that pass a certain similarity threshold, and qgram filtering allows them to do this significantly quicker than just calculating the levenstein distance on all the strings.

gshotwell avatar Jun 12 '15 18:06 gshotwell

Nope, I haven't considered it yet, but I think it should be quite easy to implement by users on top of the existing functionality. Just compute the q-gram (or jaccard or cosine) distances and filter out candidates with too large a distance before computing an edit-like distance. This would linearize the code for the total distance calculation (stringdist runs multithreaded under the hood) but only for trivial operations.

I think that building it into for example the stringdist function would not make it a lot faster than that.

I'm still planning to write a vignette one day. I think this would be a nice example to add.

markvanderloo avatar Jun 12 '15 19:06 markvanderloo

Yeah I was thinking more of a convenience function, because it's not an obvious concept for many users (or at least it took me a long time to grok). The only change in functionality would be if you wanted to include gapped qgrams, which are a bit better for filtering.

On Fri, Jun 12, 2015, 16:36 Mark van der Loo [email protected] wrote:

Nope, I haven't considered it yet, but I think it should be quite easy to implement by users on top of the existing functionality. Just compute the q-gram (or jaccard or cosine) distances and filter out candidates with too large a distance before computing an edit-like distance. This would linearize the code for the total distance calculation (stringdist runs multithreaded under the hood) but only for trivial operations.

I think that building it into for example the stringdist function would not make it a lot faster than that.

I'm still planning to write a vignette one day. I think this would be a nice example to add.

— Reply to this email directly or view it on GitHub https://github.com/markvanderloo/stringdist/issues/34#issuecomment-111596275 .

gshotwell avatar Jun 12 '15 20:06 gshotwell