ExplainaBoard Implement CalibrationAnalysis

Calibration is whether a system's confidence is well-correlated with whether the system got the answer right or not. It would be nice if we could do analyses related to calibration, such as calculating expected calibration error: https://arxiv.org/abs/1706.04599

I think this should probably be implemented as an additional variety of analysis, which would be simple and self-contained: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/analyses.py#L45

Aug 29 '22 21:08 neubig

@neubig, this is also one feature I have been expecting. The only complicated thing is we need probability as an additional feature. We probably can have some rules like this: if a system output file from a classification task contains the feature of probability, then the processor will conduct calibration adaptively, (kinda similar to training set dependent features)

Aug 29 '22 22:08 pfliu-nlp

At the moment, if an analysis is not applicable it returns None, so we could do a similar thing here.

Aug 29 '22 22:08 neubig

Yes, I noticed that. But this also will result in potential bugs when deploying the web platforms, which I have debugged for a pretty long time. It's a potential schema validation bug that will happen here: https://github.com/neulab/explainaboard_web/blob/c711cf8277c4f19d0c12e18b008b7ec2b8779d00/backend/src/impl/default_controllers_impl.py#L447

Aug 29 '22 22:08 pfliu-nlp

I don't think returning None is necessarily a bad thing if we know it's expected behavior. But we could definitely discuss ways to rectify this if it's a problem.

Aug 29 '22 22:08 neubig

I just created #418 around the topic of None.

Basically I think None is not informative as it provides no fine-grained information, and users have no control in case of invalid operations. I prefer either:

Raising exceptions to inform appropriate information outside the process. Since Python is designed to work with many exceptions (even usual control flow, such as for and StopIteration), exceptions are the first level choices over giving original semantics.
Using the result semantics in other languages:
- Rust: provides a native Result type that owns either data or error.
- Go: functions may return a tuple of data and error. if the second value (error) is non-nil it indicates some error occurred.

Aug 30 '22 04:08 odashi

I think it is better that None is used to notify only no information. If there are some information that are useful to inform externally, it is good to adopt either way described above.

Aug 30 '22 05:08 odashi

Here are some draft ideas of implementing calibration analysis:

When to perform calibration analysis: (1) tasks that have accuracy metric; (2) users provide both predicted labels and confidence values in the output file. Check that users provide confidence values in range [0, 1], not logits.
Bucketing: divide the samples into K bins where K is a hyper-parameter. Divide confidence range [0, 1] into K intervals.
Calculate the accuracy and average confidence of each bin, and calculate ECE and MCE according to formula (3) and (5) in this paper

Sep 21 '22 00:09 qjiang002

One comment: we may want to view a CalibrationAnalysisResult as either a subclass of a BucketAnalysisResult, or a BucketAnalysisResult with some auxiliary information (namely ECE and MCE). That would make it easy to, for example, display the calibration diagram using (nearly) the same code that we normally use for displaying buckets.

Sep 21 '22 00:09 neubig

@neubig Subclassing and composition should be used when there is semantically meaningful relationship between both classes. If we need only the same code, implementing separate ones is better to organize the whole thing.

Sep 21 '22 05:09 odashi

This issue was resolved.

Oct 24 '22 12:10 odashi

ExplainaBoard ExplainaBoard copied to clipboard

Implement CalibrationAnalysis

ExplainaBoard
ExplainaBoard copied to clipboard