fuzzy_match
                                
                                 fuzzy_match copied to clipboard
                                
                                    fuzzy_match copied to clipboard
                            
                            
                            
                        Match only beginning of words needed
Hi,
We've recently used fuzzy_match and found out that it produces some pretty weird name matches. Instead of matching "art" to "Artem", it chooses "Karl".
Could you please add an option "Match only from the beginning of words", so "art" would prefer "artem" to "karl" ?
Thank you.
hey, i have some ideas for this - more soon.
hey guys,
I can't duplicate your problem, so I wrote a script that you can use to show me:
macbook:~/code/fuzzy_match (master) $ ./bin/fuzzy_match_checker check "https://docs.google.com/spreadsheet/pub?key=0AkCJNpm9Ks6JdHZURUI2S2xOa3ZFVzlZb205VVhpQnc&single=true&gid=0&output=csv" --show-success --downcase
Checking matches using fuzzy_match version 1.3.2...
  "art" => "artem"
  "carl" => "karl"
  "shamus" => "seamus"
  "art" => "artem"
  "knick" => "nick"
  "nicholas" => "nick"
  "old st nick" => "nick"
Correctly matched 7 needles.
macbook:~/code/fuzzy_match (master) $
As you can see, "art" is getting matched to "artem" not "karl".
In order to get the script, you should clone this repo - I added it in this commit and it's not available in the gem (yet).
Also, you can point the script to any CSV that resembles this one
Thanks!
Best, Seamus
PS. You could also just send me a failing test.
Here's the link to edit the CSV:
https://docs.google.com/spreadsheet/ccc?key=0AkCJNpm9Ks6JdHZURUI2S2xOa3ZFVzlZb205VVhpQnc
Not sure if you'll be able to re-publish (update), so you may have to create/publish your own doc.
Sorry, maybe I described the issue incorrectly:
Haystack: "Artyom Makarov", "Nick", "Karl" Needle: "art" always returns "Karl" instead of Artyom
OK, great, saw the problem, I'll try to fix:
$ ./bin/fuzzy_match_checker check "https://docs.google.com/spreadsheet/pub?key=0AkCJNpm9Ks6JdHZURUI2S2xOa3ZFVzlZb205VVhpQnc&single=true&gid=0&output=csv" --show-success --downcase
Checking matches using fuzzy_match version 1.3.2...
  "art" => "karl"
MISMATCH: "art" should match "artyom makarov"
hey, would you help me think through the options here?
- 
add :must_match_prefixoption... it would take the shortest of two strings ("art") and only try a match if the longer one ("artyom") started with it
- 
allow fuzzy_matchusers to switch string similarity algorithm... in your case, comparing names, Jaro-Winkler might work better than Pair Distance...?> require 'amatch' => true ?> "art".pair_distance_similar("artyom makarov") => 0.3076923076923077 ?> "art".pair_distance_similar("karl") # <- karl wins => 0.4 ?> "art".jarowinkler_similar("artyom makarov") # <- artyom wins => 0.8166666666666667 ?> "art".jarowinkler_similar("karl") => 0.7222222222222222 
Neither option is very appealing... (1) seems like it would cause many false negatives. (2) offers users a choice that I don't think they should have to make.
Can you think of anything else?
I would suggest something like :match_words but that might be misleading though, what about :begin_with_match or :match_from_beginning?
Algoritm selection is another good suggestion but it's a bit hard to apply, since I got to know how each algorithm matches words prior to choosing which one would work best. So while this is something that probably should be possible to change, my vote goes to a flag to favor the matches at the beginning of the words.
We have implemented a workaround for this issue, but anyway this is imporant I think.
How would :match_from_beginning work?
In my imaginary world it would greatly increase the weight if needle matches the characters from the beginning of the string or right after the white space.
"like" should prefer "likeable" to "unlikeable", and since it's fuzzy match, "arma" should put "Artem Makarov" before "Karma".
ok, that is officially kindof crazy. but i'm still intrigued.
what was your workaround?
Splitting user names to first and last name and matching separately. So instead of ["Artyom Makarov", "Karl", "Nick"] it searches ["Artyom Makarov", "Karl", "Nick", "Artyom", "Makarov"]. In that case "art" matches "Artyom".
Another solution might be "match all letters" option. Artyom contains all the letters from art and yet it returns Karl instead.