inception
inception copied to clipboard
Krippendorfs Alpha for position - unplausible results
Describe the bug I tried to understand how Krippendorfs Alpha unitizing for position is implemented and annotated a test doc by two annotators. One annotator has 5 annotations, the other just one. If I have 1 span with exact match, the score calculated is 0,4, if I have one span with overlap match (the span differs by one token) I get 0,42. How is this possible? In fact, I would like to understand better how KA is implemented, especially how the aggreement matrix is calculated, because I observed implausible results for other docs in my corpus as well. Also, I would like to know if it were possible to implement another metric (i.e. gamma) that seems more suitable to deal with overlapping spans.
Please complete the following information:
- Version and build ID: 35.2 (2025-02-04 07:13:24, build 18f5fdcd)
- OS: Win
- Browser: Chrome
Thanks!
INCEpTION uses DKPro Agreement.
There is a paper and a couple of presentations about it for introduction:
- http://aclweb.org/anthology/C14-2023
- https://dkpro.github.io/dkpro-statistics/dkpro-agreement-tutorial.pdf
- https://dkpro.github.io/dkpro-statistics/inter-rater-agreement-tutorial.pdf
The implementation is here:
https://github.com/dkpro/dkpro-statistics/tree/main/dkpro-statistics-agreement/src/main/java/org/dkpro/statistics/agreement/unitizing
If you want to understand it, maybe start looking at that. If you get the correct numbers there, then there might be a bug in the way that INCEpTION calls DKPro Agreement. However, if you already get unexpected numbers in DKPro Agreement, then it might have a bug itself.
I have also tried doing a port of TextGamma to DKPro Agreement here:
https://github.com/dkpro/dkpro-statistics/pull/39
However, so far this port is lacking qualified review and testing to say whether it produces the expected results. Personally, I believe that Gamma is quite strange. In particular, I uses randomly generated deviations to calculate the expected disagreement. Since these deviations are random, the expected disagreement is also random - meaning the agreement score is random. Of course there are some statistical effects which constrain the randomness of the final result. However, it seems strange for me to accept that an agreement score will fluctuate (even a little) every time it is calculated.
If you look at DKPro Agreement's Krippendorf Alpha and/or the Gamma branch, best open issues/comment in that repo.
If you find everything to be in order in DKPro Agreement and suspect INCEpTION to be calling it the wrong way, best comment here again.
What may also help you is the diff export that you can get from the agreement page. For pairwise agreement, use the export that you get from clicking on a cell in the pairwise agreement table. For document-wise agreement, you can use the diff export in the sidebar. The table that is produced here is more-or-less a dump of the data that INCEpTION passes (or not) to DKPro Agreement.
In particular you can find the offsets of the positions that are passed to the agreement measure. Also look out for the USED flag which indicate if a data point has been passed on to the measure. The measure does not see any lines that are not marked with this flag.
I did a little experiment in INCEpTION in a unit test. [x-y] - offsets, (a) label
Setup 1:
- User 1:
[0-4](a) [8-9](a) - User 2:
[0-4](a) [8-9](a) - Agreement: 0.9454
Setup 2: Setup 1:
- User 1:
[0-7](a) [8-9](a) - User 2:
[0-4](a) [8-9](a) - Agreement: 0.6833
At least in this little experiment, the agreement degrades when there is overlap match instead of exact match.
Code (adjust offsets of user 1 manually to test)
@Test
void test() throws Exception
{
var layer = new AnnotationLayer(MULTI_VALUE_SPAN_TYPE, MULTI_VALUE_SPAN_TYPE,
SpanLayerSupport.TYPE, project, false, SINGLE_TOKEN, NO_OVERLAP);
layer.setId(1l);
layers.add(layer);
var feature = new AnnotationFeature(project, layer, "values", "values",
TYPE_NAME_STRING_ARRAY);
feature.setId(1l);
feature.setLinkMode(NONE);
feature.setMode(ARRAY);
features.add(feature);
var user1 = createCas(createMultiValueStringTestTypeSystem());
user1.setDocumentText("This is a test.");
buildAnnotation(user1, MULTI_VALUE_SPAN_TYPE) //
.at(0, 7) //
.withFeature("values", asList("a")) //
.buildAndAddToIndexes();
buildAnnotation(user1, MULTI_VALUE_SPAN_TYPE) //
.at(8, 9) //
.withFeature("values", asList("a")) //
.buildAndAddToIndexes();
var user2 = createCas(createMultiValueStringTestTypeSystem());
user2.setDocumentText("This is a test.");
buildAnnotation(user2, MULTI_VALUE_SPAN_TYPE) //
.at(0, 4) //
.withFeature("values", asList("a")) //
.buildAndAddToIndexes();
var measure = sut.createMeasure(feature, traits);
var result = measure.getAgreement(Map.of( //
"user1", user1, //
"user2", user2));
System.out.println(result.getAgreement());
}
INCEpTION will be filtering out all but the first/longest of the overlapping annotation snow, so hopefully this will address the issue somewhat.
Only downside is that the diff export will still show these annotations as USED even though they won't actually be used.
https://github.com/inception-project/inception/issues/5348