reach
reach copied to clipboard
Add extra tests to fuzzy match when there are equally similar titles for multiple matches
Currently fuzzymatch
gives the first of the highest matches for each reference.
So if there are two matches with the same cosine similarity, e.g. (made up example):
Reference id uber_id Title title Cosine_Similarity
289 pub.1022910723 pub.1022910723 Typhoid Typhoid 1.000000
290 pub.1022910723 pub.1071482162 Typhoid Typhoid 1.000000
we would only take the first in our output. However, it might be good at this point to check the other fields- are the authors the same (if they exist), the pub year, the journal...
Not so easy since what if the authors are the same in one, but the journal is the same in the other? We'll need to come up with some sort of hierarchy of what is most import to be similar.
To do: see how big of a problem this even is! how much do duplicates even come up where we have to make this call?