inception icon indicating copy to clipboard operation
inception copied to clipboard

Krippendorfs Alpha for position - unplausible results

Open bgittel opened this issue 9 months ago • 4 comments

Describe the bug I tried to understand how Krippendorfs Alpha unitizing for position is implemented and annotated a test doc by two annotators. One annotator has 5 annotations, the other just one. If I have 1 span with exact match, the score calculated is 0,4, if I have one span with overlap match (the span differs by one token) I get 0,42. How is this possible? In fact, I would like to understand better how KA is implemented, especially how the aggreement matrix is calculated, because I observed implausible results for other docs in my corpus as well. Also, I would like to know if it were possible to implement another metric (i.e. gamma) that seems more suitable to deal with overlapping spans.

Please complete the following information:

  • Version and build ID: 35.2 (2025-02-04 07:13:24, build 18f5fdcd)
  • OS: Win
  • Browser: Chrome

Thanks!

bgittel avatar Feb 21 '25 15:02 bgittel

INCEpTION uses DKPro Agreement.

There is a paper and a couple of presentations about it for introduction:

  • http://aclweb.org/anthology/C14-2023
  • https://dkpro.github.io/dkpro-statistics/dkpro-agreement-tutorial.pdf
  • https://dkpro.github.io/dkpro-statistics/inter-rater-agreement-tutorial.pdf

The implementation is here:

https://github.com/dkpro/dkpro-statistics/tree/main/dkpro-statistics-agreement/src/main/java/org/dkpro/statistics/agreement/unitizing

If you want to understand it, maybe start looking at that. If you get the correct numbers there, then there might be a bug in the way that INCEpTION calls DKPro Agreement. However, if you already get unexpected numbers in DKPro Agreement, then it might have a bug itself.

I have also tried doing a port of TextGamma to DKPro Agreement here:

https://github.com/dkpro/dkpro-statistics/pull/39

However, so far this port is lacking qualified review and testing to say whether it produces the expected results. Personally, I believe that Gamma is quite strange. In particular, I uses randomly generated deviations to calculate the expected disagreement. Since these deviations are random, the expected disagreement is also random - meaning the agreement score is random. Of course there are some statistical effects which constrain the randomness of the final result. However, it seems strange for me to accept that an agreement score will fluctuate (even a little) every time it is calculated.

If you look at DKPro Agreement's Krippendorf Alpha and/or the Gamma branch, best open issues/comment in that repo.

If you find everything to be in order in DKPro Agreement and suspect INCEpTION to be calling it the wrong way, best comment here again.

reckart avatar Feb 21 '25 17:02 reckart

What may also help you is the diff export that you can get from the agreement page. For pairwise agreement, use the export that you get from clicking on a cell in the pairwise agreement table. For document-wise agreement, you can use the diff export in the sidebar. The table that is produced here is more-or-less a dump of the data that INCEpTION passes (or not) to DKPro Agreement.

In particular you can find the offsets of the positions that are passed to the agreement measure. Also look out for the USED flag which indicate if a data point has been passed on to the measure. The measure does not see any lines that are not marked with this flag.

reckart avatar Feb 21 '25 18:02 reckart

I did a little experiment in INCEpTION in a unit test. [x-y] - offsets, (a) label

Setup 1:

  • User 1: [0-4](a) [8-9](a)
  • User 2: [0-4](a) [8-9](a)
  • Agreement: 0.9454

Setup 2: Setup 1:

  • User 1: [0-7](a) [8-9](a)
  • User 2: [0-4](a) [8-9](a)
  • Agreement: 0.6833

At least in this little experiment, the agreement degrades when there is overlap match instead of exact match.

Code (adjust offsets of user 1 manually to test)

@Test
    void test() throws Exception
    {
        var layer = new AnnotationLayer(MULTI_VALUE_SPAN_TYPE, MULTI_VALUE_SPAN_TYPE,
                SpanLayerSupport.TYPE, project, false, SINGLE_TOKEN, NO_OVERLAP);
        layer.setId(1l);
        layers.add(layer);

        var feature = new AnnotationFeature(project, layer, "values", "values",
                TYPE_NAME_STRING_ARRAY);
        feature.setId(1l);
        feature.setLinkMode(NONE);
        feature.setMode(ARRAY);
        features.add(feature);

        var user1 = createCas(createMultiValueStringTestTypeSystem());
        user1.setDocumentText("This is a test.");
        buildAnnotation(user1, MULTI_VALUE_SPAN_TYPE) //
                .at(0, 7) //
                .withFeature("values", asList("a")) //
                .buildAndAddToIndexes();

        buildAnnotation(user1, MULTI_VALUE_SPAN_TYPE) //
                .at(8, 9) //
                .withFeature("values", asList("a")) //
                .buildAndAddToIndexes();

        var user2 = createCas(createMultiValueStringTestTypeSystem());
        user2.setDocumentText("This is a test.");
        buildAnnotation(user2, MULTI_VALUE_SPAN_TYPE) //
                .at(0, 4) //
                .withFeature("values", asList("a")) //
                .buildAndAddToIndexes();

        var measure = sut.createMeasure(feature, traits);

        var result = measure.getAgreement(Map.of( //
                "user1", user1, //
                "user2", user2));

        System.out.println(result.getAgreement());
    }

reckart avatar Feb 22 '25 20:02 reckart

INCEpTION will be filtering out all but the first/longest of the overlapping annotation snow, so hopefully this will address the issue somewhat.

Only downside is that the diff export will still show these annotations as USED even though they won't actually be used.

https://github.com/inception-project/inception/issues/5348

reckart avatar Mar 16 '25 20:03 reckart