gerbil icon indicating copy to clipboard operation
gerbil copied to clipboard

Improve performance of position correction

Open MichaelRoeder opened this issue 7 years ago • 0 comments

Task description

In GERBIL, we are using NIF as vocabulary to describe annotated text documents. Since GERBIL is implemented in Java it uses the number of characters to determine the length of a String (and thus the positions of entities inside a given String). However, the NIF standard uses Unicode code points for determining the length of a String. Since Java uses 16 Bits for its characters, it is possible that a single Unicode code point needs more than a single character to be stored. When parsing a NIF document, GERBIL has to make sure that the positions of entities are correctly transformed into a Java String.

Example

Let's assume we have a document with the following text

Peter applauds 👏 Mary

and Peter (start=0, end=5) and Mary (start=17, end=21) have been marked in NIF (the numbers in brackets are their positions counted with Unicode code points). However, the character 👏 can not be expressed in a single Java character.

        System.out.println("Peter applauds 👏 Mary".indexOf("Peter")); // prints 0
        System.out.println("Peter applauds 👏 Mary".indexOf("Mary"));  // prints 18 (!)

It can be seen that the position of the entity Mary has to be corrected.

Current solution

The current solution searches for the first character that has the given code point count, e.g., in the case of Mary in the example above, it would search for the first character that has 17 code points from the beginning of the text to this character. The Java String class does support the counting of Unicode code points but only for sub strings. It is not possible to iterate over the Java characters of a String and sum up their code points since this will lead to a wrong sum. Because of this reason, the current implementation is working with the following loop:

        for (int i = 0; i < text.length(); ++i) {
            codePointsCount = text.codePointCount(0, i + 1);
            ...
        }

which leads to a complexity of O(n²) where n is the length of the text.

Your implement the method correctAnnotationPositions to correct the positions in linear time (regarding the length of the given text, sorting the entities won't be possible in linear time).

Hints

  • With text.codePoints() a stream of code points can be created.
  • With StringBuilder#appendCodePoint(codePoint) these code points can be transformed into Java characters.
  • You might have to adapt the sorting at the beginning of the method to put the entities (the Span objects) into the order that you need.
  • Please do not try to simply search for the entities in the text to find their position (as done in the example above) because you can not ensure that the substring you have found is the searched entity (although the probability might be high).

Testing

https://github.com/dice-group/gerbil/blob/gerbil.nif.transfer/src/test/java/org/aksw/gerbil/transfer/nif/NIFTransferTest.java is a JUnit test that can be used for testing the correctness of the method.

Forking / Merging

The solution should be implemented in a branch, forked from the gerbil.nif.transfer branch of the project. The pull request should merge the solution back into this branch.

Responsible contact

Michael Röder

MichaelRoeder avatar Dec 07 '17 10:12 MichaelRoeder