fuzzyset icon indicating copy to clipboard operation
fuzzyset copied to clipboard

more documentation. package still maintained?

Open randomgambit opened this issue 8 years ago • 7 comments

Hi,

First of all, congratulations for this amazing packages that is wayyyy faster than fuzzymatch when dealing with large datasets of strings.

Do you have more documentation about the matching algorithm that is used here? In particular I am matching sentences together (not only words) such as this is a sentence and I wanted to know if your defaut settings were appropriate in that case (ngrams=2 for instance).

How can I change them?

Many thanks for your help

randomgambit avatar Jul 27 '16 11:07 randomgambit

Hi @randomgambit , it seem there is nobody giving feedback on this amazing package. I'm trying to use it, but no documentation is there. Can you tell me whether you found more information or something seemed to this? fuzzywuzzy has good features (although also poor documentation), but there are mentioned some efficiency issues.

iarroyof avatar May 10 '17 23:05 iarroyof

I can write up some more documentation if you care about it :)

axiak avatar Jan 24 '19 17:01 axiak

that would be great, thanks!

randomgambit avatar Feb 06 '19 14:02 randomgambit

I've been using your package and it is working very well for me. However, I'm afraid I don't completely understand how it works based on your description. Is there a primary published reference for this algorithm?

In particular, the passage

Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.

Is not clear to me.

we create a list of any element in the set that has at least one occurrence of a trigram listed above

Is this a reference to the reference trigram (both the reference and the query are "listed above")?

For each of these matched elements, we compute the cosine similarity between each element and the query string.

Does "these matched elements" refer to the query or the reference? I think it only makes sense if you are taking about the cosine similarity between the reference trigram and the query string, but I could be wrong. In either case, if they match, won't the cosine similarity be perfect by definition? Additionally, you seem to be implying that you are comparing a string with 3 characters to a string with more characters. How do you calculate the cosine similarity of two strings of different length?

Based on the current description, I'm not seeing how you distinguish between different matches.

Thanks again for your efforts, and if these questions can be answered by a reference, please point me to it.

nodice73 avatar Mar 11 '19 23:03 nodice73

I don't have a paper, but it's inspired by fulltext search. In some circles you might see this called trigram or shingle indexing. @Glench wrote a wonderful intuitive description of how it works here: https://github.com/glench/fuzzyset

axiak avatar Mar 11 '19 23:03 axiak

Eh. I meant here: http://glench.github.io/fuzzyset.js/

axiak avatar Mar 11 '19 23:03 axiak

Thanks!

nodice73 avatar Mar 12 '19 00:03 nodice73