torchmetrics
torchmetrics copied to clipboard
Fleiss Kappa
🚀 Feature
Fleiss Kappa
Motivation
Fleiss Kappa is a metric of inter-rater agreement between $k$ raters. It is useful in many areas for example: combining multiple measures or ensemble methods.
Pitch
Add Fleiss Kappa as metric. I implemented it myself a while ago, but think it might be a nice addition to torchmetrics: https://github.com/cemde/FleissKappa
I am happy to give it a try, make the metric more trochmetrics-like and do a PR
Alternatives
Additional context
cool, @cemde are you willing to contribute this metric? :)
@Borda I'll give it a go!
What should the design of the call signature be? For Cohen's kappa, the two raters are implemented through the preds and target variable. With Fleiss Kappa, we have N > 1 raters, so this is not possible. Further, in its nature, it is an unsupervised metric, which raises the question of the call signature for unsupervised metrics - I couldn't find any in torchmetrics. We only need preds, but it would maybe be good to have target as input as well, for compatability with other metrics in MetricCollections.
If you're going in this direction, might be interesting to have Krippendorff's Alpha in mind as well. We chose it over Fleiss Kappa because of its ability to work with varying amounts of labelers per data point.
Not that I need it or anything, just as a note. We currently use Simpledorff for that.
@wisecornelius can I give it a stab. In case you have not started working on It already?. cc: @Borda @SkafteNicki
@krishnakalyan3 @Borda I have the background code ready. I am just waiting for a response on the call signature to finish it up.
I am just waiting for a response on the call signature to finish it up.
That would be great, just not sure what you mean by "call signature", like API?
Most metrics are called with Metric.update(pred: torch.Tensor, target: torch.Tensor). This works for Cohens Kappa, because the we have two raters. One rater will be pred, one will be target. With Fleiss Kappa, we have K raters. Therefore, I suggest a function call like Metric.update(ratings: torch.Tensor) with rating having ... x K dimensions. As far as I can see, this is the first metric to deviate from the Metric.update(pred: torch.Tensor, target: torch.Tensor, ...) pattern.
Hi @cemde, Sorry for being silent in regards to this issue.
I think it is fine for the call signature to be metric.update(ratings: torch.Tensor),
since that is also what makes most sense to me :)
We just need to specify this in the documentation.