ExplainaBoard
ExplainaBoard copied to clipboard
Implement CalibrationAnalysis
Calibration is whether a system's confidence is well-correlated with whether the system got the answer right or not. It would be nice if we could do analyses related to calibration, such as calculating expected calibration error: https://arxiv.org/abs/1706.04599
I think this should probably be implemented as an additional variety of analysis, which would be simple and self-contained: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/analyses.py#L45
@neubig, this is also one feature I have been expecting. The only complicated thing is we need probability
as an additional feature.
We probably can have some rules like this: if a system output file from a classification task contains the feature of probability
, then the processor will conduct calibration adaptively, (kinda similar to training set dependent features)
At the moment, if an analysis is not applicable it returns None
, so we could do a similar thing here.
Yes, I noticed that. But this also will result in potential bugs when deploying the web platforms, which I have debugged for a pretty long time. It's a potential schema validation bug that will happen here: https://github.com/neulab/explainaboard_web/blob/c711cf8277c4f19d0c12e18b008b7ec2b8779d00/backend/src/impl/default_controllers_impl.py#L447
I don't think returning None
is necessarily a bad thing if we know it's expected behavior. But we could definitely discuss ways to rectify this if it's a problem.
I just created #418 around the topic of None
.
Basically I think None
is not informative as it provides no fine-grained information, and users have no control in case of invalid operations. I prefer either:
- Raising exceptions to inform appropriate information outside the process. Since Python is designed to work with many exceptions (even usual control flow, such as
for
andStopIteration
), exceptions are the first level choices over giving original semantics. - Using the result semantics in other languages:
I think it is better that None
is used to notify only no information. If there are some information that are useful to inform externally, it is good to adopt either way described above.
Here are some draft ideas of implementing calibration analysis:
- When to perform calibration analysis: (1) tasks that have accuracy metric; (2) users provide both predicted labels and confidence values in the output file. Check that users provide confidence values in range [0, 1], not logits.
- Bucketing: divide the samples into K bins where K is a hyper-parameter. Divide confidence range [0, 1] into K intervals.
- Calculate the accuracy and average confidence of each bin, and calculate ECE and MCE according to formula (3) and (5) in this paper
One comment: we may want to view a CalibrationAnalysisResult
as either a subclass of a BucketAnalysisResult
, or a BucketAnalysisResult
with some auxiliary information (namely ECE and MCE). That would make it easy to, for example, display the calibration diagram using (nearly) the same code that we normally use for displaying buckets.
@neubig Subclassing and composition should be used when there is semantically meaningful relationship between both classes. If we need only the same code, implementing separate ones is better to organize the whole thing.
This issue was resolved.