cs-ranking icon indicating copy to clipboard operation
cs-ranking copied to clipboard

The f-measure is ill-defined when there are no true positives or no positive predicitons

Open timokau opened this issue 4 years ago • 3 comments

sklearn issues a warning during the tests:

sklearn.exceptions.UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in samples with no predicted labels.

This is because

  • some of the test samples generated in csrank/tests/test_choice_functions.py:trivial_choice_problem have no true positives
  • some of the learners predict no positives for some of the generated problems

In both of those cases the f-measure is not properly defined. sklearn assigns 0 and 1 respectively.

How should we deal with this? A metric should be defined for these possibilities. 0 and 1 in those cases seems somewhat reasonable, so maybe we should just silence the warning?

timokau avatar Nov 14 '19 20:11 timokau

The first problem we should avoid by generating test samples, which cannot consist of only negatives. Assigning a 1 in these cases would be sensible in general.

Regarding the second case: Assigning 0 here is sensible, since the learner achieved no true positive.

Note: My version of sklearn (0.20.2) returns 0.0 for both cases.

kiudee avatar Nov 18 '19 08:11 kiudee

You're right, sklearn returns 0.0 for both cases. The more I think about this the less sure I am that defining values for these cases is a good idea. The implementation is also non-straightforward, since we would have to do some of the work that we currently outsource to scipy.

Here are the tests I came up with:

    There are no true positives but some predicted positives; e.g. "infinite recall".
    >>> f1_measure([[False, False]], [[True, True]])
    0.0

    There are no predicted positives but some true positives; e.g. 0 recall, 0 precision.
    >>> f1_measure([[True, True]], [[False, False]])
    0.0

    There are neither true nor predicted positives, e.g. all predictions are correct:
    >>> f1_measure([[False, False]], [[False, False]])
    1.0

(2) and (3) seem pretty clear cut, but (1) should really depend on how many labels were predicted positive. Should we sidestep the issue by just defining cases (2) and (3) and continuing to throw a warning in (1)?

timokau avatar Nov 18 '19 18:11 timokau

From those three cases (2) is an obvious 0.0. For (3) the value 1.0 is sensible, but I would still throw a warning, since having no positives in an instance might hint at a problem in the dataset. Similarly, I would return 0.0 for (1) and raise a warning.

kiudee avatar Nov 22 '19 14:11 kiudee