grobid
grobid copied to clipboard
Improvement of the recovery of Pragmatic Segmenter sentence segmentation text wrt to the original text offsets
The pragmatic segmenter seems to modify the output string which makes the extraction of the sentence offsets more complicated.
This PR makes two changes:
- Implementation using a diff-based approach to extract correctly sentence segmentation offsets when the sentences are not matching 100%
- fix issue #753
Recovery of segmentation offset using Pragmatic Segmenter
This implementation uses an external library to compute the diff of the sentence and the original text.
For example
original = "This is the original text. Some spaces are going to be removed. sentence = "This is the original text."
The idea is to get the starting and ending within original
taking as reference sentence
.
After translating the diff in a char based string, it compute the starting and ending of the sentence within the text.
The diff is a list of string where each element is structured as [operation, space, character], having operation
= +
, -
or
(e.g. comparing the string a
and ab
would result in the following:
diff = [' a', '- b']
After the diff is computed using a two-pass heuristic:
- starting from left, we collect all the characters starting from the first character that is common a both strings
- starting from the right, we remove all the characters that do not equal in the diff.
The heuristic also is limited to a subset of the string, should a sentence has been identified before.
coverage: 39.498% (-0.4%) from 39.903% when pulling 5aca6b85c306bafa2b94076ed36a745e64553b48 on sentence-segmentation-detection into 5b145364914d984ecbe6ec4afaaf57abd65b2a4a on master.
The last commit a577523 should fix the issue #753