dedupe icon indicating copy to clipboard operation
dedupe copied to clipboard

Use cross validation to for choosing threshold value

Open fgregg opened this issue 9 years ago • 2 comments

Our current approach uses the predicted probability of matching to estimate a good threshold. If our predictions are not good then everything suffers.

We should use a cross-validation approach, using the training data. We should block and cluster all the data and then use the labels from the training pair (and only those labels) to evaluate a threshold.

If we depend on training data, then we can't use the threshold method for static classes. This actually makes sense since given trained model, the output of the threshold method should be deterministic.

fgregg avatar Aug 20 '15 13:08 fgregg

One challenge here is that the training data does not have unique ids.

fgregg avatar Aug 20 '15 13:08 fgregg

/sub

webmaven avatar Sep 02 '15 19:09 webmaven