Kappa computation when predicted scores are on a different scale
In a situation where predicted scores are on a completely different scale from the observed scores, kappa computation fails because the range of possible scores is too large. We saw this recently with SGDRegressor which produced scores in the range 1233372304332.22 to 1723509896207.16 when human scores were 1-5.
Few possible solutions: (1) Do not compute kappa if there is no overlap in range between predicted and observed scores. (2) Do not compute kappa if the range is greater than certain threshold.
Thoughts?
This was presumably where there was no trimming?
If there is still some overlap in range (though perhaps only partial), would you still compute kappa?
Yes, this happens for raw scores. I suppose it is possible that the new predictions would be wildly out of scale but yet overlap with the human labels and we will still have the same issue. May be (2) is a better solution then?
if the range is greater than certain threshold.
what does range refer to exactly? The range of the predicted scores?
For kappa computation the set of possible kappa labels is defined as all integers between min(min(human), min(system) and max(max(human), max(system)). This is what I mean by range. In the example above this becomes a very large set since the two sets of scores are on different scale.
Ah, I see. And what sort of threshold would you use?
The goal is to make sure that the users get kappa value assuming we can compute it.
We can go for 500,000,000 based on https://stackoverflow.com/questions/855191/how-big-can-a-python-list-get or set it to 100,000 to allow for lower resource.
I don't think we want to go for the worst case number like that since that can still lead to memory errors. I feel like this threshold should depend on the range of the human score labels no? I mean if the human score labels are range(1, 6), then perhaps we should only allow some multiple of that range size?
So if the range is 6, we allow for example 60?
Is there a realistic scenario where the ranges diverge by more than that and kappa will still be meaningful?
No, I don't think so. I suppose theoretically you could end up with human 1,2,3,4 and system 1, 200, 400, 600 but kappas won't be particularly meaningful in this case. So we do (max(human)-min(human)*10?