pragmatic_segmenter icon indicating copy to clipboard operation
pragmatic_segmenter copied to clipboard

Whitespace getting mangled even with clean turned off

Open akhudek opened this issue 6 years ago • 10 comments

I'm trying to test this library against some larger english corpora but I'm running into trouble aligning the results back to the original text. Even with "clean" turned off, the resulting sentences have modified whitespace in a seemingly unpredictable way.

Unfortunately I can't supply the data to demonstrate the problem due to license restrictions. Looking at the code it doesn't appear that there is any easy to to ensure that the returned sentence text is fully unmodified, is that correct?

akhudek avatar Mar 01 '18 20:03 akhudek

Hi @akhudek. When I wrote this I wasn't too worried about keeping whitespace unmodified, so even with clean turned off it might get modified. It might be a small change to fix that, I'm not sure though without some test examples. Would you be able to provide some examples? In this case, all we care about is the white space so you can replace any words with foo.

diasks2 avatar Mar 01 '18 22:03 diasks2

The data is from http://ota.ox.ac.uk/desc/2551, which is actually free but oddly enough has a license preventing redistribution. Here is a censored sample that exhibits the issue, plus several segmentation issues. Perhaps the length of the string is contributing, trying various smaller substrings produce different outputs.

  AAAAAAA'A aaa Aaaaaaaaa, Aa Aaaa Aaaaaa, aaaaaaaaa aaaaaa aa aaaa aaa aaaaaaa aaaa aaaa a aaaaaaa aa aaaaaa aaa aaaaaaa aaaaaaa aaaaa aaaaaa Aaaaaaaaaa, aaaaaaaaaaaaa aaaaaaa aaa, aaaa aaaa aa Aaaaaa.   Aaaaaaaaaa aa Aaaaaaa Aaaaaa Aaaa, aaaaaa aaaaa aaaaaa aaaaaaaa aa aaa Aaaaaaaaa aaaaaaa. Aaaaa 0,000 aaaaaaa aa aaa Aaaaaa aaaaaaa aa aaaaaaaa aaaaaaa aaa aaa Aaaa aaaaa aaaa aaaaa aaaaaaa Aa Aaaaaa aa aaaaa aa Aaaaaa.   `Aaaaaa aaa aaaaaaaa aa," aaa aaaaaa aaaa. Aaaaaaa aaaaaaa aaaa Aaaaaaa'a Aaaaaaaaaa aaa aaaa aaa aaaa aa Aaaaaaaaa Aaaaaaa Aaaaaaaaaa.  Aaa Aaaa aaaaaaaa aaa Aaaa aaaaa aaaa aaa aaaaaaa aa aaaaaa a aaaaa, aaaaa aaaaaaaaa aaa aaa aaaaaaaaaa aa Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa.   Aa Aaaaaa, a aaa-Aaaaaa Aaaaaaaa Aaaaaaaaa, aaa aaaaaaaa aa aaa aaaaaaa aa Aaaaa aa aaaAaaaaa-aaaaaaaaaa aaaaa aa aaaaaaa aaaa aaaaa aaaaaaa a aaaaaaa, aaa aa aa aaaaaaaaa aaaa aaaa.   Aaaaaaaa aaaa Aa Aaaaaa aaaaa aaa aaaa aa aaaaaaa aaaaaaa aaaaaaa aaa aaaa aaaa aaaaaaaa aaaa aaaaa aaa 00 aaaaa.   Aa Aaaaaa aaa aaaa aaaaaaa aaaaa aaa aaaaaaaaaaa aa aaa aaaaaaa, aaa aa aaaaaaaa aa aaaaaaa Aa Aaaaa aa-Aaaa, Aaaaa Aaaaaaaa aa a aaaaa aaaaaaaaaaaaaa aa Aaa Aaaa.   Aaa aaaaa aaaaaaaaa aa Aaaa Aaaaaa aaaaaaaaa a aaaaa aa aaaaaaaa aa aaa Aaaaaaaaa aaaa. Aaaaa aaaa aaa aaaaaaaa aaa aaaaaa Aaa Aaaa'a aaaaaaaa aa aaaaa aaa aaaaaaaaaaaa aaaaaaaa aaa Aaaaaaaaaa aaaa Aa Aaaaaa.   Aaaaaaa'a aaaaaaaaaaaa aaaaaa aa Aa'aaaa aa aaaa aa Aaa Aaaa, aa Aa Aaaaaa aaa aaaaaaaa aa aaaa aaa aaaaaaaaaa aa Aaaaaa-aaaa Aaaa Aaaaaa.   Aa Aaaaaa aaa aaaaaa aa aaa Aaaaaa aaaaa, aa-Aaaaaaa, aa aaaaaa aa aaaaa aaaa aa aaaaaaa aaaa aaaa Aaaaa. `Aaaaaaaa Aaaaaaa aaa Aaaaa aaa aaa aaaaaaaaa aaaa aaa aaa aaaaaa aa aaa aaaaaa[aaaaaa] aaaaa aaa aaaaaaaaaaa aa aaaaaaaaaaa."   AAAA   AAAAAAAA aaaaaa aaaaaaaaa aaa aaaaaaaaaaaaa aaaaaaaaaaaa aa a aa-aaaa aaaaa Aaaa-Aaa Aaaaa aaaaaaaaaa aaaaaaa aaaaaa aaaaaaaaa aa aa aaaaaa aa aaaa aaaa 0,000 aaa aaaaaaa aaa aaaaaa aaaaaa. Aaaa aaaaa aaaaaaaaa aaaaaaaaa, aaaaaaaa

An example of the problem is at:

Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa.   Aa Aaaaaa,

where there are three spaces after the period. After segmentation I get:

Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa. Aa Aaaaaa,

Perhaps it's notable that these whitespace modifications are also places where it should have marked a sentence boundary. In this case the above snippet is returned within a single "sentence" as returned by the library.

akhudek avatar Mar 01 '18 22:03 akhudek

What is your expected behavior? Does the whitespace belong to the preceding segment or the segment following the whitespace? What you describe as whitespace modification, I would argue is possibly not the responsibility of a library like this.

it "Whitespace example #1" do
  ps = PragmaticSegmenter::Segmenter.new(text: "There it is!              I found it.", language: "en")
  expect(ps.segment).to eq(["There it is!", "I found it."])
end

it "Whitespace example #2" do
  ps = PragmaticSegmenter::Segmenter.new(text: "There it is!              I found it.", language: "en")
  expect(ps.segment).to eq(["There it is!", "              I found it."])
end

it "Whitespace example #3" do
  ps = PragmaticSegmenter::Segmenter.new(text: "There it is!              I found it.", language: "en")
  expect(ps.segment).to eq(["There it is!              ", "I found it."])
end

it "Whitespace example #4" do
  ps = PragmaticSegmenter::Segmenter.new(text: "There it is!              I found it.", language: "en")
  expect(ps.segment).to eq(["There it is!      ", "        I found it."])
end

I think we can rule out example 4 above. Example 1 is the current behavior. You are proposing Example 2 or Example 3 if I am understanding you correctly.

This is Wikipedia's definition: "Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation is the problem in natural language processing of deciding where sentences begin and end". In my opinion, sentences don't begin or end with whitespace.

diasks2 avatar Mar 01 '18 23:03 diasks2

If I put the entire string above I get the following sentences: AAAAAAA'A aaa Aaaaaaaaa, Aa Aaaa Aaaaaa, aaaaaaaaa aaaaaa aa aaaa aaa aaaaaaa aaaa aaaa a aaaaaaa aa aaaaaa aaa aaaaaaa aaaaaaa aaaaa aaaaaa Aaaaaaaaaa, aaaaaaaaaaaaa aaaaaaa aaa, aaaa aaaa aa Aaaaaa.


Aaaaaaaaaa aa Aaaaaaa Aaaaaa Aaaa, aaaaaa aaaaa aaaaaa aaaaaaaa aa aaa Aaaaaaaaa aaaaaaa.


Aaaaa 0,000 aaaaaaa aa aaa Aaaaaa aaaaaaa aa aaaaaaaa aaaaaaa aaa aaa Aaaa aaaaa aaaa aaaaa aaaaaaa Aa Aaaaaa aa aaaaa aa Aaaaaa.


`Aaaaaa aaa aaaaaaaa aa," aaa aaaaaa aaaa. Aaaaaaa aaaaaaa aaaa Aaaaaaa'a Aaaaaaaaaa aaa aaaa aaa aaaa aa Aaaaaaaaa Aaaaaaa Aaaaaaaaaa.  Aaa Aaaa aaaaaaaa aaa Aaaa aaaaa aaaa aaa aaaaaaa aa aaaaaa a aaaaa, aaaaa aaaaaaaaa aaa aaa aaaaaaaaaa aa Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa. Aa Aaaaaa, a aaa-Aaaaaa Aaaaaaaa Aaaaaaaaa, aaa aaaaaaaa aa aaa aaaaaaa aa Aaaaa aa aaaAaaaaa-aaaaaaaaaa aaaaa aa aaaaaaa aaaa aaaaa aaaaaaa a aaaaaaa, aaa aa aa aaaaaaaaa aaaa aaaa. Aaaaaaaa aaaa Aa Aaaaaa aaaaa aaa aaaa aa aaaaaaa aaaaaaa aaaaaaa aaa aaaa aaaa aaaaaaaa aaaa aaaaa aaa 00 aaaaa. Aa Aaaaaa aaa aaaa aaaaaaa aaaaa aaa aaaaaaaaaaa aa aaa aaaaaaa, aaa aa aaaaaaaa aa aaaaaaa Aa Aaaaa aa-Aaaa, Aaaaa Aaaaaaaa aa a aaaaa aaaaaaaaaaaaaa aa Aaa Aaaa. Aaa aaaaa aaaaaaaaa aa Aaaa Aaaaaa aaaaaaaaa a aaaaa aa aaaaaaaa aa aaa Aaaaaaaaa aaaa. Aaaaa aaaa aaa aaaaaaaa aaa aaaaaa Aaa Aaaa'a aaaaaaaa aa aaaaa aaa aaaaaaaaaaaa aaaaaaaa aaa Aaaaaaaaaa aaaa Aa Aaaaaa. Aaaaaaa'a aaaaaaaaaaaa aaaaaa aa Aa'aaaa aa aaaa aa Aaa Aaaa, aa Aa Aaaaaa aaa aaaaaaaa aa aaaa aaa aaaaaaaaaa aa Aaaaaa-aaaa Aaaa Aaaaaa. Aa Aaaaaa aaa aaaaaa aa aaa Aaaaaa aaaaa, aa-Aaaaaaa, aa aaaaaa aa aaaaa aaaa aa aaaaaaa aaaa aaaa Aaaaa. `Aaaaaaaa Aaaaaaa aaa Aaaaa aaa aaa aaaaaaaaa aaaa aaa aaa aaaaaa aa aaa aaaaaa[aaaaaa] aaaaa aaa aaaaaaaaaaa aa aaaaaaaaaaa."

And a few more. The last fourth sentence should actually have been split into several sentences, but I'm leaving that aside for now. The unexpected whitespace modification occurs within the fourth sentence where the number of spaces at some places is changed. On of them is after the period here:

Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa. Aa Aaaaaa,

But there are others places too. I would have expected that if it returns a string as a sentence, the text within would be unmodified.

akhudek avatar Mar 01 '18 23:03 akhudek

Could you create a test case for the above? Including the:

  • Original string
  • Array of the segmented strings you are expecting

diasks2 avatar Mar 01 '18 23:03 diasks2

If possible, it would be nice if we could isolate each issue to its own test case, with the example sentence as short as possible to still reveal the issue.

diasks2 avatar Mar 01 '18 23:03 diasks2

I'm going to integrate the MASC into my tests, it has a more permissive license than the BNC and I can include test cases from that. I suspect part of the issue may actually be the length of the string, but will try to narrow down.

akhudek avatar Mar 02 '18 05:03 akhudek

Does the whitespace belong to the preceding segment or the segment following the whitespace? What you describe as whitespace modification, I would argue is possibly not the responsibility of a library like this... definition:" [...] where sentences begin and end". In my opinion, sentences don't begin or end with whitespace.

I would agree with that. In fact in my use case what I actually need to know is where each sentence begins and ends. So, to answer:

What is your expected behavior?

I'd say, ideally:

it "Whitespace example #1" do
  ps = PragmaticSegmenter::Segmenter.new(text: "There it is!              I found it.", language: "en")
  expect(ps.segment).to eq([
    { begin: 0, end: 12, text: "There it is!"},
    { begin: 26, end: 36, text: "I found it."}
  ])
end

(Of course given begin: and end: then text: is redundant, but still it would be handy when inspecting the results.)

In my use case I reconstruct the begin and end values by marching through the results a sentence at a time while stepping through the input, searching for each subsequent sentence after some amount of white space from the previous sentence. Doable, but would be nice if I didn't need to.

ronen avatar Mar 03 '18 07:03 ronen

Here is a simple example:

t = "Its registration number is 322 284 696 R.C.S. Paris,     and its telephone number is [     ].">

ps = PragmaticSegmenter::Segmenter.new(text: t, language: 'en', clean: false)
=> #<PragmaticSegmenter::Segmenter:0x00007fae9be30660
 @doc_type=nil,
 @language="en",
 @language_module=PragmaticSegmenter::Languages::English,
 @text="Its registration number is 322 284 696 R.C.S. Paris,     and its telephone number is [     ].">

ps = PragmaticSegmenter::Segmenter.new(text: t, language: 'en', clean: false).segment
=> ["Its registration number is 322 284 696 R.C.S. Paris, and its telephone number is [ ]."]

Notice that the spaces before "and" and inside the bracket "[ ]" are removed

echan00 avatar Jul 30 '19 17:07 echan00

I ended up commenting out @language::ExtraWhiteSpaceRule in my fork to resolve this

https://github.com/echan00/pragmatic_segmenter/commit/e5e4244bacd0bd12e65b560b648d331980fc1ce4

echan00 avatar Jul 30 '19 18:07 echan00