dedupe Use cross validation to for choosing threshold value

Use cross validation to for choosing threshold value

Open fgregg opened this issue 9 years ago • 2 comments

Our current approach uses the predicted probability of matching to estimate a good threshold. If our predictions are not good then everything suffers.

We should use a cross-validation approach, using the training data. We should block and cluster all the data and then use the labels from the training pair (and only those labels) to evaluate a threshold.

If we depend on training data, then we can't use the threshold method for static classes. This actually makes sense since given trained model, the output of the threshold method should be deterministic.

Aug 20 '15 13:08 fgregg

One challenge here is that the training data does not have unique ids.

Aug 20 '15 13:08 fgregg

/sub

Sep 02 '15 19:09 webmaven

dedupe dedupe copied to clipboard

Use cross validation to for choosing threshold value

dedupe
dedupe copied to clipboard