reach icon indicating copy to clipboard operation
reach copied to clipboard

Add extra tests to fuzzy match when there are equally similar titles for multiple matches

Open lizgzil opened this issue 5 years ago • 0 comments

Currently fuzzymatch gives the first of the highest matches for each reference. So if there are two matches with the same cosine similarity, e.g. (made up example):

       Reference id         uber_id         Title         title         Cosine_Similarity
289  pub.1022910723  pub.1022910723         Typhoid         Typhoid         1.000000
290  pub.1022910723  pub.1071482162         Typhoid         Typhoid         1.000000

we would only take the first in our output. However, it might be good at this point to check the other fields- are the authors the same (if they exist), the pub year, the journal...

Not so easy since what if the authors are the same in one, but the journal is the same in the other? We'll need to come up with some sort of hierarchy of what is most import to be similar.

To do: see how big of a problem this even is! how much do duplicates even come up where we have to make this call?

lizgzil avatar May 10 '19 14:05 lizgzil