spaCy
spaCy copied to clipboard
find-threshold: CLI command for multi-label classifier threshold tuning
Goal
Add a find-threshold CLI command investigating different threshold values for multi-label classification model and returning the one maximizing the F-score.
Description
- New CLI command
find-threshold; API call isfind_threshold(). - New tests in
spacy.tests.test_cli. - Docs will be added once the code has been reviewed.
Supported options are:
pipe_name: Which pipe to evaluate (with pipelines with multipleMultiLabel_TextCategorizercomponents the name has to be specified, otherwise it's optional).average: Whether to usemicroormacroto compute F-score over all labels.n_trials: Number of sample points in threshold space between 0 and 1.beta: Beta coefficient for F-score calculation.
Types of change
New feature.
Checklist
- [x] I confirm that I have the right to submit this contribution under the project's MIT license.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
I realize this is a draft, but some general concerns:
- there are multiple components with thresholds, can this be implemented more generally?
- I'd rather see the scoring code extended (e.g., for
betainPRFScoreandScorer.score_cats) than for this to be completely reimplemented here - the args should look similar to
spacy evaluate(note that it's a known problem inspacy evaluatethat theDocBinis loaded generically rather than following[corpora.dev]from the config, which can cause issues sometimes)
The textcat component is a bit weird because the threshold only affects the scoring, so the annotation doesn't change with the threshold even though for most other components modifying the thresholds would affect the annotation.
- there are multiple components with thresholds, can this be implemented more generally?
Should be possible. Would it be acceptable if we ditch the automatic component recognition then and always require naming the component to be evaluated?
- I'd rather see the scoring code extended (e.g., for
betainPRFScoreandScorer.score_cats) than for this to be completely reimplemented here
Wasn't aware of this. Scorer.score_cats should be a good fit.
- the args should look similar to
spacy evaluate(note that it's a known problem inspacy evaluatethat theDocBinis loaded generically rather than following[corpora.dev]from the config, which can cause issues sometimes)
I'll look into harmonizing the arguments.
The
textcatcomponent is a bit weird because the threshold only affects the scoring, so the annotation doesn't change with the threshold even though for most other components modifying the thresholds would affect the annotation.
Can you elaborate on how modifying thresholds would affect annotations for other components?
In general the situation is that you have a component that has:
- some config setting that's a threshold that's used in
set_annotationsor the scorer - some score that's returned by its scorer that you want to maximize
Examples:
textcat_multilabel:textcat_multilabel.thresholdand the scorecats_macro_aucspancat:spancat.thresholdand the scorespans_sc_fspan_finder:span_finder.thresholdand the scorespan_finder_sc_f
I would initially think that beta could be added as a scorer setting in updated versions of the default scorers so it can be customized in the configs. For the exploration here it might be useful if beta was easier to override than by modifying the configs, but I think I'd start from the point that there's an existing score in the scorer results that can be used directly and just look at the threshold here?
Are there smart(-ish) ways to...
- ...dynamically identify whether a component is suited for the find-threshold command?
- ...map the correct scorer method to the component? Like
score_spans()forspan_finder(?) orscore_cats()forspancatandtextcat_multilabel?
We can hardcode this ofc, but I was wondering whether there's a better way to do this. 1. could be done by checking for the existence of a threshold attribute, but this is not quite consistent - e.g. SpanFinder has one, but TextCategorizer stores its threshold in self.cfg (I could check both, but this makes me question whether there are other variations).
No, I think we need to rely on the user to provide a path to the threshold in the config and the scores key to optimize.
In v4, I'm planning to move all these settings out of self.cfg so they're only stored in the config, but this kind of threshold could have an arbitrary name in the config.
No, I think we need to rely on the user to provide a path to the threshold in the config and the scores key to optimize.
It's not just the scores key, I think. The scoring method in Scorer also has be specified or chosen if we want to this to be generic. E.g. span_finder wouldn't work with Scorer.score_cats(). Or am I misunderstanding something here?
Clarification: I interpret "scores key" to be the attr attribute in the method to be called for scoring, e.g. Scorer.score_cats().
The component already has a registered scorer, so what I mean by "scores key" is the entry in the output of Language.evaluate that you want to optimize, like cats_macro_f. You don't care how/where this was calculated by the scorer, just that it ends up in scores.
The latest commit should be closer to a generic solution. Two remarks:
betahasn't been introduced yet. I'd add it as optional argument toPRFScoreto pass it forward fromScorer, if it's available in the latter's config. Are there any potential pitfalls to consider when doing this?- The part of the test running a
spancatcomponent fails so far becausenlp.evaluate()returnsNonefor the relevant scores (upon whichfind_threshold()exits). It's probably related to how I set up the component, but I haven't spotted the issue yet - if there's anything obvious, I'd be thankful for a pointer.
Added a draft for integrating beta. Feedback much appreciated.
I don't think beta makes sense as a Scorer-level setting but rather as an individual component scorer setting that would be set in the config, e.g.:
@registry.scorers("spacy.textcat_scorer.v1")
def make_textcat_scorer(beta: float = 1.0):
return partial(textcat_score, beta=beta)
The existing textcat scorer is kind of a bad example because threshold should also already be a scorer setting but isn't.
For example, you could have two spancat components with different beta settings.
Let me know if changes for textcat_multilabel/score_cats() and spancat/score_spans() match what you meant. If so, I'll update the other components and scoring functions.
I meant that beta is not a parameter for the CLI command, but that beta is added as a scorer parameter in the Scorer methods and the component scorers. beta would already be set in the component config by the user based on their task. Adding beta is probably better suited as a separate PR, though, and it would increment all the registered scorer versions.
For find-threshold you would still just care that there is some score in scores (and w/o modifying _pipe_configs directly, as it's done now).
Can you remove all the beta-related code from this PR (as a CLI arg and in the scorer) and also remove the universe.json in the tests that was added accidentally?
@adrianeboyd Are we good with this or do we want another review round?
@adrianeboyd Does the current state of this PR address your concerns? If so, I'll update the docs.
I tried this out on a number of demo projects. From a UX perspective I think it would be better to display the table rows as available (so it doesn't look like it's hanging) and then the result at the end (especially when the table gets long), e.g.:
Threshold cats_micro_f
0.0 0.23328648865689622
0.05 0.649127003109304
0.1 0.7909604519774011
0.15000000000000002 0.8279835390946503
0.2 0.8376998979244642
0.25 0.8420502092050208
0.30000000000000004 0.8413231912258978
0.35000000000000003 0.8423132183908045
0.4 0.8418570910409865
0.45 0.8397142333760762
0.5 0.8393055042482452
0.55 0.8367156954012288
0.6000000000000001 0.8329579579579579
0.65 0.8292130590677488
0.7000000000000001 0.8258801141769743
0.75 0.8228090317642556
0.8 0.8195836545875097
0.8500000000000001 0.8139941690962099
0.9 0.8065277231616202
0.9500000000000001 0.7898216790222401
1.0 0.0
Best threshold: 0.35 with value of 0.8423132183908045.
I think some rounding would make the thresholds look less weird and probably makes sense for the scores, too.
The default of 10 leads to kind of unexpected thresholds (at least for me) with the 0.0-1.0 range:
Threshold cats_macro_f
0.0 0.1752746762039245
0.1111111111111111 0.6099358428167384
0.2222222222222222 0.617525276284966
0.3333333333333333 0.5966641697784788
0.4444444444444444 0.5663210985143793
0.5555555555555556 0.5362910007032317
0.6666666666666666 0.4959949806373485
0.7777777777777777 0.4494350003267488
0.8888888888888888 0.37344248469207914
1.0 0.0
I don't know whether a default of 11 or a default of +1 trials would be better? For non-weirdness I would vote for +1 trials, but I can see that it's odd that it's not running the number of trials you actually specified.
There's also the UX question of what to do when the user provides a case where looking for a threshold doesn't make sense (exclusive classes) or where the score doesn't make sense or isn't affected by the threshold (ROC AUC), because the output doesn't really clarify what's going on. This is more of a problem for textcat than for spancat because of the fact that the threshold is only used for scoring purposes.
With exclusive classes:
ℹ Optimizing for cats_macro_f for component 'textcat' with 10
trials.
Best threshold: 0.0 with value of 0.8675944188764702.
Threshold cats_macro_f
0.0 0.8675944188764702
0.1111111111111111 0.8675944188764702
0.2222222222222222 0.8675944188764702
0.3333333333333333 0.8675944188764702
0.4444444444444444 0.8675944188764702
0.5555555555555556 0.8670025878841685
0.6666666666666666 0.855006470080778
0.7777777777777777 0.8427331998562555
0.8888888888888888 0.8106946161627515
1.0 0.0
And the default score cats_score for the textcat multilabel case where you do care about the threshold is cats_macro_auc, so you'd have to know to select a non-default score for this to make sense.
ℹ Optimizing for cats_score for component 'textcat_multilabel' with 10
trials.
Best threshold: 0.0 with value of 0.8696220207994788.
Threshold cats_score
0.0 0.8696220207994788
0.1111111111111111 0.8696220207994788
0.2222222222222222 0.8696220207994788
0.3333333333333333 0.8696220207994788
0.4444444444444444 0.8696220207994788
0.5555555555555556 0.8696220207994788
0.6666666666666666 0.8696220207994788
0.7777777777777777 0.8696220207994788
0.8888888888888888 0.8696220207994788
1.0 0.8696220207994788
The spancat case is easier than the textcat one, although figuring out the score to provide may be difficult for users:
Best threshold: 0.5556 with value of 0.6720516962843295.
Threshold spans_grit_f
0.0 0.019298799747315222
0.1111111111111111 0.6576373212942063
0.2222222222222222 0.6553846153846155
0.3333333333333333 0.6619496855345912
0.4444444444444444 0.6677265500794912
0.5555555555555556 0.6720516962843295
0.6666666666666666 0.671556642216789
0.7777777777777777 0.6627906976744186
0.8888888888888888 0.6558497011101622
1.0 0.0
For textcat I think it's a bug in the scorer that the threshold is not set to 0.0 for multi_label=False. It doesn't make sense for there to be a threshold setting at all in the component, but this is probably a carry-over from before the defaults for exclusive classes were modified?
From a UX perspective I think it would be better to display the table rows as available (so it doesn't look like it's hanging) and then the result at the end...
Done.
I don't know whether a default of 11 or a default of +1 trials would be better?
I changed the default to 11 (agree it looks unexpected otherwise). I wouldn't run with a number of trials differing from what the user specified.
There's also the UX question of what to do when the user provides a case where looking for a threshold doesn't make sense.
Hm, that's tricky. I don't think there's a way to detect that, right? Hard-coding some warning for specific use cases is possible, but would be be annoying to maintain. So maybe just emphasize in the docs that threshold optimization doesn't make sense in all scenarios and list examples?
For textcat I think it's a bug in the scorer that the threshold is not set to
0.0formulti_label=False.
Do we want to address this in this PR?
No, not in this PR. It should look the same as the cats_macro_auc with the same values for all thresholds. I'm trying to think about how you could detect the problematic cases without knowing anything about the scorer. You probably can't, but I think we can add more info/warnings for the built-in components like textcat about when this doesn't make sense.
You probably can't, but I think we can add more info/warnings for the built-in components like textcat about when this doesn't make sense.
So emitting a warning in those components directly whenever threshold has been set but wouldn't change anything about the outcome?
I meant emitting a warning in find-threshold if you were doing something that didn't make sense, to avoid users trying to do this for cases where we know it's not sensible.
Otherwise they're going to be scratching their heads for textcat_multilabel, like they're using cats_score as the provided reference score and they know the threshold can matter, but then the results are weird.
I'm also not getting the results I'd expect for precision with a threshold of 1.0, which looks like an issue with PRFScore as opposed to an individual scoring method:
Threshold cats_macro_p
0.0 0.132
0.25 0.666
0.5 0.774
0.75 0.838
1.0 0.0
Threshold cats_macro_r
0.0 1.0
0.25 0.597
0.5 0.477
0.75 0.37
1.0 0.0
I think the precision of nothing is 1.0 rather than 0.0, which is what you currently get...
Let's see, to catch some common problems:
- check for
textcatsince the threshold doesn't make sense for exclusive classes - check for all returned values being the same since this indicates something wrong, and then don't recommend 0.0 as a threshold for this case
- for
textcat_multilabelsuggest usingcats_macro_forcats_micro_finstead
- for
Can you have a look at the conflicts?
This is quite weird. Apparently my master had diverged from explosion:master (I guess due to the new approach to the release prep?). Synchronizing it eradicated all commits in this PR by an automatic force-pushed initiated by GitHub. I'm working on re-adding the changes to this branch.
Should be fine now. Are we ok with this? Then I'd update the docs.
It would be a good idea to provide some standard settings for this command
I'll include some in the docs. Do you have any suggestions other than this one you'd like to have included?
It would be a good idea to provide some standard settings for this command
I'll include some in the docs. Do you have any suggestions other than this one you'd like to have included?
It'd be nice to include one for each of the main pipeline components we see as relevant - currently mainly multilabel textcat & spancat, no?