rsmtool icon indicating copy to clipboard operation
rsmtool copied to clipboard

Kappa computation when predicted scores are on a different scale

Open aloukina opened this issue 5 years ago • 10 comments

In a situation where predicted scores are on a completely different scale from the observed scores, kappa computation fails because the range of possible scores is too large. We saw this recently with SGDRegressor which produced scores in the range 1233372304332.22 to 1723509896207.16 when human scores were 1-5.

Few possible solutions: (1) Do not compute kappa if there is no overlap in range between predicted and observed scores. (2) Do not compute kappa if the range is greater than certain threshold.

Thoughts?

aloukina avatar Jan 22 '20 19:01 aloukina

This was presumably where there was no trimming?

If there is still some overlap in range (though perhaps only partial), would you still compute kappa?

aoifecahill avatar Jan 23 '20 00:01 aoifecahill

Yes, this happens for raw scores. I suppose it is possible that the new predictions would be wildly out of scale but yet overlap with the human labels and we will still have the same issue. May be (2) is a better solution then?

aloukina avatar Jan 23 '20 16:01 aloukina

if the range is greater than certain threshold.

what does range refer to exactly? The range of the predicted scores?

desilinguist avatar Jan 28 '20 17:01 desilinguist

For kappa computation the set of possible kappa labels is defined as all integers between min(min(human), min(system) and max(max(human), max(system)). This is what I mean by range. In the example above this becomes a very large set since the two sets of scores are on different scale.

aloukina avatar Jan 28 '20 18:01 aloukina

Ah, I see. And what sort of threshold would you use?

desilinguist avatar Jan 28 '20 18:01 desilinguist

The goal is to make sure that the users get kappa value assuming we can compute it.

We can go for 500,000,000 based on https://stackoverflow.com/questions/855191/how-big-can-a-python-list-get or set it to 100,000 to allow for lower resource.

aloukina avatar Jan 28 '20 18:01 aloukina

I don't think we want to go for the worst case number like that since that can still lead to memory errors. I feel like this threshold should depend on the range of the human score labels no? I mean if the human score labels are range(1, 6), then perhaps we should only allow some multiple of that range size?

desilinguist avatar Jan 28 '20 18:01 desilinguist

So if the range is 6, we allow for example 60?

aloukina avatar Jan 28 '20 18:01 aloukina

Is there a realistic scenario where the ranges diverge by more than that and kappa will still be meaningful?

desilinguist avatar Jan 28 '20 18:01 desilinguist

No, I don't think so. I suppose theoretically you could end up with human 1,2,3,4 and system 1, 200, 400, 600 but kappas won't be particularly meaningful in this case. So we do (max(human)-min(human)*10?

aloukina avatar Jan 28 '20 18:01 aloukina