spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

find-threshold: CLI command for multi-label classifier threshold tuning

Open rmitsch opened this issue 3 years ago • 13 comments

Goal

Add a find-threshold CLI command investigating different threshold values for multi-label classification model and returning the one maximizing the F-score.

Description

  • New CLI command find-threshold; API call is find_threshold().
  • New tests in spacy.tests.test_cli.
  • Docs will be added once the code has been reviewed.

Supported options are:

  • pipe_name: Which pipe to evaluate (with pipelines with multiple MultiLabel_TextCategorizer components the name has to be specified, otherwise it's optional).
  • average: Whether to use micro or macro to compute F-score over all labels.
  • n_trials: Number of sample points in threshold space between 0 and 1.
  • beta: Beta coefficient for F-score calculation.

Types of change

New feature.

Checklist

  • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [x] I ran the tests, and all new and existing tests passed.
  • [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

rmitsch avatar Aug 08 '22 14:08 rmitsch

I realize this is a draft, but some general concerns:

  • there are multiple components with thresholds, can this be implemented more generally?
  • I'd rather see the scoring code extended (e.g., for beta in PRFScore and Scorer.score_cats) than for this to be completely reimplemented here
  • the args should look similar to spacy evaluate (note that it's a known problem in spacy evaluate that the DocBin is loaded generically rather than following [corpora.dev] from the config, which can cause issues sometimes)

The textcat component is a bit weird because the threshold only affects the scoring, so the annotation doesn't change with the threshold even though for most other components modifying the thresholds would affect the annotation.

adrianeboyd avatar Aug 09 '22 07:08 adrianeboyd

  • there are multiple components with thresholds, can this be implemented more generally?

Should be possible. Would it be acceptable if we ditch the automatic component recognition then and always require naming the component to be evaluated?

  • I'd rather see the scoring code extended (e.g., for beta in PRFScore and Scorer.score_cats) than for this to be completely reimplemented here

Wasn't aware of this. Scorer.score_cats should be a good fit.

  • the args should look similar to spacy evaluate (note that it's a known problem in spacy evaluate that the DocBin is loaded generically rather than following [corpora.dev] from the config, which can cause issues sometimes)

I'll look into harmonizing the arguments.

The textcat component is a bit weird because the threshold only affects the scoring, so the annotation doesn't change with the threshold even though for most other components modifying the thresholds would affect the annotation.

Can you elaborate on how modifying thresholds would affect annotations for other components?

rmitsch avatar Aug 09 '22 10:08 rmitsch

In general the situation is that you have a component that has:

  • some config setting that's a threshold that's used in set_annotations or the scorer
  • some score that's returned by its scorer that you want to maximize

Examples:

  • textcat_multilabel: textcat_multilabel.threshold and the score cats_macro_auc
  • spancat: spancat.threshold and the score spans_sc_f
  • span_finder:span_finder.threshold and the score span_finder_sc_f

I would initially think that beta could be added as a scorer setting in updated versions of the default scorers so it can be customized in the configs. For the exploration here it might be useful if beta was easier to override than by modifying the configs, but I think I'd start from the point that there's an existing score in the scorer results that can be used directly and just look at the threshold here?

adrianeboyd avatar Aug 09 '22 11:08 adrianeboyd

Are there smart(-ish) ways to...

  1. ...dynamically identify whether a component is suited for the find-threshold command?
  2. ...map the correct scorer method to the component? Like score_spans() for span_finder (?) or score_cats() for spancat and textcat_multilabel?

We can hardcode this ofc, but I was wondering whether there's a better way to do this. 1. could be done by checking for the existence of a threshold attribute, but this is not quite consistent - e.g. SpanFinder has one, but TextCategorizer stores its threshold in self.cfg (I could check both, but this makes me question whether there are other variations).

rmitsch avatar Aug 26 '22 11:08 rmitsch

No, I think we need to rely on the user to provide a path to the threshold in the config and the scores key to optimize.

In v4, I'm planning to move all these settings out of self.cfg so they're only stored in the config, but this kind of threshold could have an arbitrary name in the config.

adrianeboyd avatar Aug 29 '22 08:08 adrianeboyd

No, I think we need to rely on the user to provide a path to the threshold in the config and the scores key to optimize.

It's not just the scores key, I think. The scoring method in Scorer also has be specified or chosen if we want to this to be generic. E.g. span_finder wouldn't work with Scorer.score_cats(). Or am I misunderstanding something here?

Clarification: I interpret "scores key" to be the attr attribute in the method to be called for scoring, e.g. Scorer.score_cats().

rmitsch avatar Aug 31 '22 09:08 rmitsch

The component already has a registered scorer, so what I mean by "scores key" is the entry in the output of Language.evaluate that you want to optimize, like cats_macro_f. You don't care how/where this was calculated by the scorer, just that it ends up in scores.

adrianeboyd avatar Aug 31 '22 09:08 adrianeboyd

The latest commit should be closer to a generic solution. Two remarks:

  • beta hasn't been introduced yet. I'd add it as optional argument to PRFScore to pass it forward from Scorer, if it's available in the latter's config. Are there any potential pitfalls to consider when doing this?
  • The part of the test running a spancat component fails so far because nlp.evaluate() returns None for the relevant scores (upon which find_threshold() exits). It's probably related to how I set up the component, but I haven't spotted the issue yet - if there's anything obvious, I'd be thankful for a pointer.

rmitsch avatar Sep 01 '22 12:09 rmitsch

Added a draft for integrating beta. Feedback much appreciated.

rmitsch avatar Sep 01 '22 14:09 rmitsch

I don't think beta makes sense as a Scorer-level setting but rather as an individual component scorer setting that would be set in the config, e.g.:

@registry.scorers("spacy.textcat_scorer.v1")
def make_textcat_scorer(beta: float = 1.0):
    return partial(textcat_score, beta=beta)

The existing textcat scorer is kind of a bad example because threshold should also already be a scorer setting but isn't.

For example, you could have two spancat components with different beta settings.

adrianeboyd avatar Sep 02 '22 07:09 adrianeboyd

Let me know if changes for textcat_multilabel/score_cats() and spancat/score_spans() match what you meant. If so, I'll update the other components and scoring functions.

rmitsch avatar Sep 02 '22 10:09 rmitsch

I meant that beta is not a parameter for the CLI command, but that beta is added as a scorer parameter in the Scorer methods and the component scorers. beta would already be set in the component config by the user based on their task. Adding beta is probably better suited as a separate PR, though, and it would increment all the registered scorer versions.

For find-threshold you would still just care that there is some score in scores (and w/o modifying _pipe_configs directly, as it's done now).

adrianeboyd avatar Sep 02 '22 11:09 adrianeboyd

Can you remove all the beta-related code from this PR (as a CLI arg and in the scorer) and also remove the universe.json in the tests that was added accidentally?

adrianeboyd avatar Sep 12 '22 07:09 adrianeboyd

@adrianeboyd Are we good with this or do we want another review round?

rmitsch avatar Sep 27 '22 11:09 rmitsch

@adrianeboyd Does the current state of this PR address your concerns? If so, I'll update the docs.

rmitsch avatar Oct 20 '22 11:10 rmitsch

I tried this out on a number of demo projects. From a UX perspective I think it would be better to display the table rows as available (so it doesn't look like it's hanging) and then the result at the end (especially when the table gets long), e.g.:

Threshold             cats_micro_f       
0.0                   0.23328648865689622
0.05                  0.649127003109304  
0.1                   0.7909604519774011 
0.15000000000000002   0.8279835390946503 
0.2                   0.8376998979244642 
0.25                  0.8420502092050208 
0.30000000000000004   0.8413231912258978 
0.35000000000000003   0.8423132183908045 
0.4                   0.8418570910409865 
0.45                  0.8397142333760762 
0.5                   0.8393055042482452 
0.55                  0.8367156954012288 
0.6000000000000001    0.8329579579579579 
0.65                  0.8292130590677488 
0.7000000000000001    0.8258801141769743 
0.75                  0.8228090317642556 
0.8                   0.8195836545875097 
0.8500000000000001    0.8139941690962099 
0.9                   0.8065277231616202 
0.9500000000000001    0.7898216790222401 
1.0                   0.0

Best threshold: 0.35 with value of 0.8423132183908045. 

I think some rounding would make the thresholds look less weird and probably makes sense for the scores, too.

The default of 10 leads to kind of unexpected thresholds (at least for me) with the 0.0-1.0 range:

Threshold            cats_macro_f       
0.0                  0.1752746762039245 
0.1111111111111111   0.6099358428167384 
0.2222222222222222   0.617525276284966  
0.3333333333333333   0.5966641697784788 
0.4444444444444444   0.5663210985143793 
0.5555555555555556   0.5362910007032317 
0.6666666666666666   0.4959949806373485 
0.7777777777777777   0.4494350003267488 
0.8888888888888888   0.37344248469207914
1.0                  0.0

I don't know whether a default of 11 or a default of +1 trials would be better? For non-weirdness I would vote for +1 trials, but I can see that it's odd that it's not running the number of trials you actually specified.

There's also the UX question of what to do when the user provides a case where looking for a threshold doesn't make sense (exclusive classes) or where the score doesn't make sense or isn't affected by the threshold (ROC AUC), because the output doesn't really clarify what's going on. This is more of a problem for textcat than for spancat because of the fact that the threshold is only used for scoring purposes.

With exclusive classes:

ℹ Optimizing for cats_macro_f for component 'textcat' with 10
trials.
Best threshold: 0.0 with value of 0.8675944188764702. 
Threshold            cats_macro_f      
0.0                  0.8675944188764702
0.1111111111111111   0.8675944188764702
0.2222222222222222   0.8675944188764702
0.3333333333333333   0.8675944188764702
0.4444444444444444   0.8675944188764702
0.5555555555555556   0.8670025878841685
0.6666666666666666   0.855006470080778 
0.7777777777777777   0.8427331998562555
0.8888888888888888   0.8106946161627515
1.0                  0.0

And the default score cats_score for the textcat multilabel case where you do care about the threshold is cats_macro_auc, so you'd have to know to select a non-default score for this to make sense.

ℹ Optimizing for cats_score for component 'textcat_multilabel' with 10
trials.
Best threshold: 0.0 with value of 0.8696220207994788. 
Threshold            cats_score        
0.0                  0.8696220207994788
0.1111111111111111   0.8696220207994788
0.2222222222222222   0.8696220207994788
0.3333333333333333   0.8696220207994788
0.4444444444444444   0.8696220207994788
0.5555555555555556   0.8696220207994788
0.6666666666666666   0.8696220207994788
0.7777777777777777   0.8696220207994788
0.8888888888888888   0.8696220207994788
1.0                  0.8696220207994788

The spancat case is easier than the textcat one, although figuring out the score to provide may be difficult for users:

Best threshold: 0.5556 with value of 0.6720516962843295. 
Threshold            spans_grit_f        
0.0                  0.019298799747315222
0.1111111111111111   0.6576373212942063  
0.2222222222222222   0.6553846153846155  
0.3333333333333333   0.6619496855345912  
0.4444444444444444   0.6677265500794912  
0.5555555555555556   0.6720516962843295  
0.6666666666666666   0.671556642216789   
0.7777777777777777   0.6627906976744186  
0.8888888888888888   0.6558497011101622  
1.0                  0.0

adrianeboyd avatar Oct 24 '22 07:10 adrianeboyd

For textcat I think it's a bug in the scorer that the threshold is not set to 0.0 for multi_label=False. It doesn't make sense for there to be a threshold setting at all in the component, but this is probably a carry-over from before the defaults for exclusive classes were modified?

adrianeboyd avatar Oct 24 '22 08:10 adrianeboyd

From a UX perspective I think it would be better to display the table rows as available (so it doesn't look like it's hanging) and then the result at the end...

Done.

I don't know whether a default of 11 or a default of +1 trials would be better?

I changed the default to 11 (agree it looks unexpected otherwise). I wouldn't run with a number of trials differing from what the user specified.

There's also the UX question of what to do when the user provides a case where looking for a threshold doesn't make sense.

Hm, that's tricky. I don't think there's a way to detect that, right? Hard-coding some warning for specific use cases is possible, but would be be annoying to maintain. So maybe just emphasize in the docs that threshold optimization doesn't make sense in all scenarios and list examples?

rmitsch avatar Oct 24 '22 09:10 rmitsch

For textcat I think it's a bug in the scorer that the threshold is not set to 0.0 for multi_label=False.

Do we want to address this in this PR?

rmitsch avatar Oct 24 '22 09:10 rmitsch

No, not in this PR. It should look the same as the cats_macro_auc with the same values for all thresholds. I'm trying to think about how you could detect the problematic cases without knowing anything about the scorer. You probably can't, but I think we can add more info/warnings for the built-in components like textcat about when this doesn't make sense.

adrianeboyd avatar Oct 24 '22 10:10 adrianeboyd

You probably can't, but I think we can add more info/warnings for the built-in components like textcat about when this doesn't make sense.

So emitting a warning in those components directly whenever threshold has been set but wouldn't change anything about the outcome?

rmitsch avatar Oct 24 '22 11:10 rmitsch

I meant emitting a warning in find-threshold if you were doing something that didn't make sense, to avoid users trying to do this for cases where we know it's not sensible.

adrianeboyd avatar Oct 24 '22 12:10 adrianeboyd

Otherwise they're going to be scratching their heads for textcat_multilabel, like they're using cats_score as the provided reference score and they know the threshold can matter, but then the results are weird.

adrianeboyd avatar Oct 24 '22 12:10 adrianeboyd

I'm also not getting the results I'd expect for precision with a threshold of 1.0, which looks like an issue with PRFScore as opposed to an individual scoring method:

Threshold    cats_macro_p
0.0          0.132     
0.25         0.666     
0.5          0.774     
0.75         0.838     
1.0          0.0

Threshold    cats_macro_r
0.0          1.0       
0.25         0.597     
0.5          0.477     
0.75         0.37      
1.0          0.0

I think the precision of nothing is 1.0 rather than 0.0, which is what you currently get...

adrianeboyd avatar Oct 24 '22 13:10 adrianeboyd

Let's see, to catch some common problems:

  • check for textcat since the threshold doesn't make sense for exclusive classes
  • check for all returned values being the same since this indicates something wrong, and then don't recommend 0.0 as a threshold for this case
    • for textcat_multilabel suggest using cats_macro_f or cats_micro_f instead

adrianeboyd avatar Oct 25 '22 11:10 adrianeboyd

Can you have a look at the conflicts?

adrianeboyd avatar Nov 11 '22 09:11 adrianeboyd

This is quite weird. Apparently my master had diverged from explosion:master (I guess due to the new approach to the release prep?). Synchronizing it eradicated all commits in this PR by an automatic force-pushed initiated by GitHub. I'm working on re-adding the changes to this branch.

rmitsch avatar Nov 11 '22 10:11 rmitsch

Should be fine now. Are we ok with this? Then I'd update the docs.

rmitsch avatar Nov 11 '22 11:11 rmitsch

It would be a good idea to provide some standard settings for this command

I'll include some in the docs. Do you have any suggestions other than this one you'd like to have included?

rmitsch avatar Nov 17 '22 09:11 rmitsch

It would be a good idea to provide some standard settings for this command

I'll include some in the docs. Do you have any suggestions other than this one you'd like to have included?

It'd be nice to include one for each of the main pipeline components we see as relevant - currently mainly multilabel textcat & spancat, no?

svlandeg avatar Nov 17 '22 09:11 svlandeg