fuzzy_match icon indicating copy to clipboard operation
fuzzy_match copied to clipboard

Match only beginning of words needed

Open firedev opened this issue 13 years ago • 12 comments

Hi,

We've recently used fuzzy_match and found out that it produces some pretty weird name matches. Instead of matching "art" to "Artem", it chooses "Karl".

Could you please add an option "Match only from the beginning of words", so "art" would prefer "artem" to "karl" ?

Thank you.

firedev avatar Apr 01 '12 18:04 firedev

hey, i have some ideas for this - more soon.

seamusabshere avatar Apr 02 '12 14:04 seamusabshere

hey guys,

I can't duplicate your problem, so I wrote a script that you can use to show me:

macbook:~/code/fuzzy_match (master) $ ./bin/fuzzy_match_checker check "https://docs.google.com/spreadsheet/pub?key=0AkCJNpm9Ks6JdHZURUI2S2xOa3ZFVzlZb205VVhpQnc&single=true&gid=0&output=csv" --show-success --downcase
Checking matches using fuzzy_match version 1.3.2...
  "art" => "artem"
  "carl" => "karl"
  "shamus" => "seamus"
  "art" => "artem"
  "knick" => "nick"
  "nicholas" => "nick"
  "old st nick" => "nick"
Correctly matched 7 needles.
macbook:~/code/fuzzy_match (master) $

As you can see, "art" is getting matched to "artem" not "karl".

In order to get the script, you should clone this repo - I added it in this commit and it's not available in the gem (yet).

Also, you can point the script to any CSV that resembles this one

Thanks!

Best, Seamus

PS. You could also just send me a failing test.

seamusabshere avatar Apr 03 '12 02:04 seamusabshere

Here's the link to edit the CSV:

https://docs.google.com/spreadsheet/ccc?key=0AkCJNpm9Ks6JdHZURUI2S2xOa3ZFVzlZb205VVhpQnc

Not sure if you'll be able to re-publish (update), so you may have to create/publish your own doc.

seamusabshere avatar Apr 03 '12 02:04 seamusabshere

Sorry, maybe I described the issue incorrectly:

Haystack: "Artyom Makarov", "Nick", "Karl" Needle: "art" always returns "Karl" instead of Artyom

firedev avatar Apr 04 '12 11:04 firedev

OK, great, saw the problem, I'll try to fix:

$ ./bin/fuzzy_match_checker check "https://docs.google.com/spreadsheet/pub?key=0AkCJNpm9Ks6JdHZURUI2S2xOa3ZFVzlZb205VVhpQnc&single=true&gid=0&output=csv" --show-success --downcase
Checking matches using fuzzy_match version 1.3.2...
  "art" => "karl"
MISMATCH: "art" should match "artyom makarov"

seamusabshere avatar Apr 04 '12 14:04 seamusabshere

hey, would you help me think through the options here?

  1. add :must_match_prefix option... it would take the shortest of two strings ("art") and only try a match if the longer one ("artyom") started with it

  2. allow fuzzy_match users to switch string similarity algorithm... in your case, comparing names, Jaro-Winkler might work better than Pair Distance...

    ?> require 'amatch' => true ?> "art".pair_distance_similar("artyom makarov") => 0.3076923076923077 ?> "art".pair_distance_similar("karl") # <- karl wins => 0.4 ?> "art".jarowinkler_similar("artyom makarov") # <- artyom wins => 0.8166666666666667 ?> "art".jarowinkler_similar("karl") => 0.7222222222222222

Neither option is very appealing... (1) seems like it would cause many false negatives. (2) offers users a choice that I don't think they should have to make.

Can you think of anything else?

seamusabshere avatar Apr 10 '12 01:04 seamusabshere

I would suggest something like :match_words but that might be misleading though, what about :begin_with_match or :match_from_beginning?

Algoritm selection is another good suggestion but it's a bit hard to apply, since I got to know how each algorithm matches words prior to choosing which one would work best. So while this is something that probably should be possible to change, my vote goes to a flag to favor the matches at the beginning of the words.

We have implemented a workaround for this issue, but anyway this is imporant I think.

firedev avatar Apr 10 '12 17:04 firedev

How would :match_from_beginning work?

seamusabshere avatar Apr 10 '12 18:04 seamusabshere

In my imaginary world it would greatly increase the weight if needle matches the characters from the beginning of the string or right after the white space.

"like" should prefer "likeable" to "unlikeable", and since it's fuzzy match, "arma" should put "Artem Makarov" before "Karma".

firedev avatar Apr 10 '12 18:04 firedev

ok, that is officially kindof crazy. but i'm still intrigued.

what was your workaround?

seamusabshere avatar Apr 10 '12 19:04 seamusabshere

Splitting user names to first and last name and matching separately. So instead of ["Artyom Makarov", "Karl", "Nick"] it searches ["Artyom Makarov", "Karl", "Nick", "Artyom", "Makarov"]. In that case "art" matches "Artyom".

firedev avatar Apr 10 '12 20:04 firedev

Another solution might be "match all letters" option. Artyom contains all the letters from art and yet it returns Karl instead.

firedev avatar Apr 10 '12 21:04 firedev