correlate
correlate copied to clipboard
Make it possible to work with normalized score values right from the start
It is often important to set a minimum_score in order to reduce the number of false positive matches. At the moment, this is not possible, since the minimum_score works on the unnormalized score values, which are hard to guess without trying out a few match cases on your actual data.
It would be good to have a way to say "please work with normalized scores", so that minimum_score could be given as a value between 0-1, regardless of how the scores pan out. A value of e.g. 10% (=0.1) would then mean "drop matches with a confidence value of less than 10%".
I'm not sure whether this is easily possible with the logic used by the package, but it make it more usable with yet unknown data distributions.
So, I have this implemented, and I'm staring at it deciding whether or not I like the API, and if I need to adjust the implementation. See what you think.
I added a new keyword-only parameter to the Correlator.correlate()
method, prenormalized=False
. If you set this to a true value, then correlate will normalize the values before applying minimum_score
. So minimum_score
is now a normalized value too, which is what you wanted.
It's called prenormalize
just to differentiate it from the CorrelateResult.normalize()
method. I thought that having two things called normalize
would be confusing. Since this happens earlier in the process, before minimum_score
is applied, well! I went with that name.
There's one ugly wrinkle in the current implementation. You see, the order of operations--before making this change--looks like this:
- find all matches with a nonzero score, including reusing values from datasets A and B multiple times
- apply
minimum_score
to truncate this list of all possible matches - use the "match boiler" greedy algorithm to keep at most only one match using any value from either A or B
- return the matches
- optionally, normalize the scores on the matches
The problem is that, by adding a normalization step between the first and second steps above, we're normalizing based on a different set of matches. The match boiler almost always throws away lots of matches near the end of the list of matches, which could include the lowest-scoring match, which would mean that it shouldn't be used when normalizing. It's not going to throw the match off much. But if you ask for the scores to be normalized, you're expecting the top match to be 1.0 and the bottom match to be 0.0, and this would mean the bottom match often (nearly always?) being a couple epsilons above zero.
To fix that, I'd have to change the implementation when prenormalize
was true to do something like this:
- find all matches with a nonzero score, including reusing values from A and B multiple times
- use the
match boiler
greedy algorithm to keep at most only one match using any value from either A or B - normalize based on the remaining matches
- apply
minimum_score
to truncate the list of remaining matches - return the matches
Two downsides. First, it gives more work to the match boiler, which can be expensive. But this extra work is identical to specifying minimum_score=0
, so it's probably not a big deal. Second, it means a slightly different sequence of events when prenormalize
is on; we apply the match boiler before applying minimum_score
. I don't know how surprising that is, or if it's a big deal. I mention it simply because it is a difference. But then again, the point of the API is to change behavior. So this is a minor, perhaps unexpected change in behavior accompanying the desired deliberate change in behavior. It's probably fine.
I am not sure I understand your normalization approach. Wouldn't it be possible to use a normalized score to begin with ? That way, you could probably remove the whole normalization logic in the algorithm itself.
Some background: In the tool where I'm trying to replace my matching logic with yours, I have a score function which uses edit distance values (among other things) to come up with a value describing "how far is this sample text away from the reference text". Since the edit distance is bounded by the length of the strings, I can calculate a normalized score with values between 0..1, without any additional normalization step applied in the algorithm.
I then calculate the score of the sample text against all reference texts, sort them and take the highest score to determine the potential match. If the highest score is below the minimum_score
, I record a non-match. If above, I record a match.
Your normalization is a little different, it seems. You determine the normalization after matching and so the mapping between the original score and the normalized value depends on the distribution of matches in your data sets. This can pose a problem if the scores in your matching set are low in general. A minimum_score
of 0.1 would then still cut away the lowest scoring matches, but it could no longer be regarded as a measurement for the (minimum) quality of the matches you have in the matching set. It would just mean "cut away matches with a score lower than 1/10 of the highest match score you found" and is probably not what the user expects.
Note: The above is mostly academic in my particular use case, since matches typically have high scores (I only rarely have missing data in the reference texts, which could cause a low score) and correlate performs well, even without normalization and leaving the minimum_score
set to 0.
In the above, you critiqued my approach to normalizing scores, and you described how you normalize scores in a different algorithm. You asked "is there a way to pre-normalize the scores?" but I don't see where you suggested anything specific.
At the moment I can only think of two approaches to "pre-normalize" the score, what I will call "global" and "local" approaches. I don't think either is an improvement.
In the "global" approach, correlate would remember the value in either A or B with the highest possible score--what would the score be if 100% of the keys matched?--then normalize all scores based on the lower of these two. (This would be the theoretical highest possible score for any match.) Since the purpose of correlate is to find matches between imperfectly-matched corpuses, in most use cases the highest actual score observed would get nowhere near 1.0 under this approach, making it surprising to the user that this was called a "normalized" score. "Normalized" scores using this approach would range from 0 to some random number less than 1. I don't see how this is an improvement over the existing non-normalized scores, where scores range from 0 to some unbounded random number. Both are arbitrary ranges, and my goal with the "normalize" feature was to adjust the range so it always went from 0 to 1.
(Maybe you would luck out and this would produce a genuinely normalized score distribution. You seem to be lucky, in that you suggest your data is unusually clean. For the use cases where I really need correlate I rarely get nice high scores.)
In the "local" approach, for every match, instead of considering its absolute score, we would consider its relative score. Correlate would calculate the maximum possible score for the match--again, calculate the highest possible score for each of the values from A and B, and use the lower of the two--and divide the actual score by that value to "normalize" the score. This scoring approach defeats one of the features of correlate's scoring approach, where matches between rare keys are scored higher than matches between common keys. So this approach would greatly reduce the quality of correlate's matching. Also, perfect matches in this scenario are more likely than in the "global" approach--but still not very likely. So this too would likely result in the maximum scored match being a random number less than 1, again making it surprising that this is called "normalization".
I'm happy to consider any concrete backwards-compatible proposals on how to improve correlate's scoring for your use case. But, since you seem to find my existing proposals allowing use of minimum score on normalized scores unsatisfactory, I won't bother to continue to work on it. I don't need it, and so far you're the only person making this request.
My suggestion was to consider replacing the scoring function with a normalized one in order to avoid having to guess a minimum_score
value and allow for automation when running against sets of data with unknown distributions and properties.
It's well possible that the way your scoring works simply doesn't allow for this, which would make the proposal mood. Looks like I'll have to dive into the code to make that more concrete :-)
I genuinely wish you the best of luck! I'd love to get contributors to correlate. It's as simple as I could make it while keeping it performant and less-memory-intensive... which, sadly, is "not very".
Also, I can't imagine a normalization approach that would make it viable to have a one-size-fits-all minimum score. Consider: one correlation might be between two perfectly-matched corpuses, where there are no bad matches. Another might be matches between two non-overlapping corpuses where every match is wrong. I don't see how you could have a normalized scoring algorithm that lets you correctly apply the same minimum score to both datasets--much less while simultaneously preserving correlate's current excellent ability at finding good matches.
Finally: your original feature request--via email, not so far in a Github issue--was to allow minimum score to apply to normalized scores. If you're willing to accept my definition of "normalized", that's a feature I can absolutely do.
In my tool, I'm using a scoring function which takes two strings as input and returns a value between 0..1 which can be interpreted as confidence value of the match. This is completely independent of the datasets you may be looking at.
Using a cut-off value of e.g. 0.1 means that confidence in the match is too low to be taken into account. This doesn't imply perfection, though, the function may return 0.2 for a (factual) non-match. To work around such false positives, you'd have to apply more complex scoring algorithms, with a lot more context knowledge, but that would at the same time limit the usefulness for the general case. It's a tradeoff.
There's another difference between correlate and my approach. correlate looks at whole datasets, while my tool takes a one-to-many approach (comparing a single input against a reference set and finding the match with the highest score). It doesn't take order of input strings into account, but it does look at the order of the input characters and words via the edit distance approach.
I'm not saying my tool is better - much to the contrary, - but believe that correlate could perhaps use a similar, possibly even exchangeable, scoring function with normalized output to enable use with arbitrary datasets as you often have in the wild.
Typical use cases from my work experience are e.g. names of companies used in trading data, names of people or street addresses in PII data, etc. Dirty data in general, which requires cleanup and normalization to be useful for statistics, reporting, as input for machine learning, etc.
Consider two possible matches, which I am making up out of whole cloth right now. Assume that these are only two matches out of six hundred.
In match alpha, value A has 8 exact keys, value B has 6 exact keys, and 4 of them match. Based on how frequently the keys are used, they yield scores of 1/9, 1/18, 1/35, and 1/66. They also each have a fuzzy key which yields a raw score of 0.6; based on the cumulative score those keys contribute to the datasets, the adjusted score is 0.6/8.77.
In match beta, value A has 10 exact keys, value B has 4 exact keys, and 4 of them match. The adjusted scores are 1/12, 1/28, 1/36, and 1/96. They also have a fuzzy key which yields a raw score of 0.35 and an adjusted score of 0.35/9.26.
How would you propose changing the scoring algorithm to normalize these scores? Can you really do it in isolation, where, if data was added or removed from the datasets, the normalized scores would not change?
I'm going to close this issue. To the extent that I understand what you're proposing, I don't think it can actually work. I'm still amenable to discussing it, but first you would have to propose a specific algorithm for pre-normalizing scores. You can reopen this issue or create a new issue as you think is best.