dedupe
dedupe copied to clipboard
Use cross validation to for choosing threshold value
Our current approach uses the predicted probability of matching to estimate a good threshold. If our predictions are not good then everything suffers.
We should use a cross-validation approach, using the training data. We should block and cluster all the data and then use the labels from the training pair (and only those labels) to evaluate a threshold.
If we depend on training data, then we can't use the threshold method for static classes. This actually makes sense since given trained model, the output of the threshold method should be deterministic.
One challenge here is that the training data does not have unique ids.
/sub