gerbil
gerbil copied to clipboard
Improve performance of position correction
Task description
In GERBIL, we are using NIF as vocabulary to describe annotated text documents. Since GERBIL is implemented in Java it uses the number of characters to determine the length of a String (and thus the positions of entities inside a given String). However, the NIF standard uses Unicode code points for determining the length of a String. Since Java uses 16 Bits for its characters, it is possible that a single Unicode code point needs more than a single character to be stored. When parsing a NIF document, GERBIL has to make sure that the positions of entities are correctly transformed into a Java String.
Example
Let's assume we have a document with the following text
Peter applauds 👏 Mary
and Peter (start=0, end=5)
and Mary (start=17, end=21)
have been marked in NIF (the numbers in brackets are their positions counted with Unicode code points). However, the character 👏
can not be expressed in a single Java character.
System.out.println("Peter applauds 👏 Mary".indexOf("Peter")); // prints 0
System.out.println("Peter applauds 👏 Mary".indexOf("Mary")); // prints 18 (!)
It can be seen that the position of the entity Mary
has to be corrected.
Current solution
The current solution searches for the first character that has the given code point count, e.g., in the case of Mary
in the example above, it would search for the first character that has 17 code points from the beginning of the text to this character. The Java String class does support the counting of Unicode code points but only for sub strings. It is not possible to iterate over the Java characters of a String and sum up their code points since this will lead to a wrong sum. Because of this reason, the current implementation is working with the following loop:
for (int i = 0; i < text.length(); ++i) {
codePointsCount = text.codePointCount(0, i + 1);
...
}
which leads to a complexity of O(n²) where n is the length of the text.
Your implement the method correctAnnotationPositions to correct the positions in linear time (regarding the length of the given text, sorting the entities won't be possible in linear time).
Hints
- With
text.codePoints()
a stream of code points can be created. - With
StringBuilder#appendCodePoint(codePoint)
these code points can be transformed into Java characters. - You might have to adapt the sorting at the beginning of the method to put the entities (the
Span
objects) into the order that you need. - Please do not try to simply search for the entities in the text to find their position (as done in the example above) because you can not ensure that the substring you have found is the searched entity (although the probability might be high).
Testing
https://github.com/dice-group/gerbil/blob/gerbil.nif.transfer/src/test/java/org/aksw/gerbil/transfer/nif/NIFTransferTest.java is a JUnit test that can be used for testing the correctness of the method.
Forking / Merging
The solution should be implemented in a branch, forked from the gerbil.nif.transfer
branch of the project. The pull request should merge the solution back into this branch.