pyDVL Use scorer's default value for all utility computations

Use scorer's default value for all utility computations

Open mdbenito opened this issue 1 year ago • 2 comments

We don't always use the default value of the Scorer. For instance when iterating over a permutation, we might want to do:

prev_score = u({})

Then we also have a similar hardcoded choice in Utility._utility(), where we always return 0.0 for an empty index set, and to be consistent, we should return self.scorer.default. However, this breaks user code, because the previous default value for scorer.default was np.nan, and not 0.0.

We could remove the default assignment and force the user to select one.

As to the choice of 0.0, we could switch to the score of a constant regressor (using the mean of the data) or a classifier predicting frequencies in training data.

Originally posted by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/issues/346#issuecomment-1528779970

May 11 '23 09:05 mdbenito

@mdbenito In the case of classification (scorer: predictive accuracy) and considering the label distribution (0.7, 0.3)it is not really clear which default value to use. By assuming it is 0, we assume actually that the negation (in binary classification) of the model is 100% correct.

May 29 '23 21:05 kosmitive

@kosmitive I'm not sure I follow. The score doesn't bear any relation to the label distribution. A score of 0 for the empty set is just a convention. What does it mean for a model to be trained on the empty set? We could use the expected value of the score of the model being randomly initialised, with the expectation taken under some assumption for the distribution of its parameters, but that cannot be done in a general way.

Taking a constant regressor or classifier could also be a bad idea: training on one sample is definitely going to lead to a worse result than a properly chosen constant model (e.g. mean of the data for regression or label frequencies for classification), which means that the marginal utility u({i})-u({}) will always be negative. We should think of the implications this has.

Aug 19 '23 15:08 mdbenito

pyDVL pyDVL copied to clipboard

Use scorer's default value for all utility computations

pyDVL
pyDVL copied to clipboard