grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Improvement of the recovery of Pragmatic Segmenter sentence segmentation text wrt to the original text offsets

Open lfoppiano opened this issue 3 years ago • 2 comments

The pragmatic segmenter seems to modify the output string which makes the extraction of the sentence offsets more complicated.

This PR makes two changes:

  • Implementation using a diff-based approach to extract correctly sentence segmentation offsets when the sentences are not matching 100%
  • fix issue #753

Recovery of segmentation offset using Pragmatic Segmenter

This implementation uses an external library to compute the diff of the sentence and the original text.

For example

original = "This is the original text. Some spaces are going to be removed. sentence = "This is the original text."

The idea is to get the starting and ending within original taking as reference sentence. After translating the diff in a char based string, it compute the starting and ending of the sentence within the text.

The diff is a list of string where each element is structured as [operation, space, character], having operation = +, - or (e.g. comparing the string a and ab would result in the following: diff = [' a', '- b']

After the diff is computed using a two-pass heuristic:

  • starting from left, we collect all the characters starting from the first character that is common a both strings
  • starting from the right, we remove all the characters that do not equal in the diff.

The heuristic also is limited to a subset of the string, should a sentence has been identified before.

lfoppiano avatar Jan 27 '21 07:01 lfoppiano

Coverage Status

coverage: 39.498% (-0.4%) from 39.903% when pulling 5aca6b85c306bafa2b94076ed36a745e64553b48 on sentence-segmentation-detection into 5b145364914d984ecbe6ec4afaaf57abd65b2a4a on master.

coveralls avatar May 11 '21 00:05 coveralls

The last commit a577523 should fix the issue #753

lfoppiano avatar Jul 29 '22 00:07 lfoppiano