pragmatic_segmenter
pragmatic_segmenter copied to clipboard
Whitespace getting mangled even with clean turned off
I'm trying to test this library against some larger english corpora but I'm running into trouble aligning the results back to the original text. Even with "clean" turned off, the resulting sentences have modified whitespace in a seemingly unpredictable way.
Unfortunately I can't supply the data to demonstrate the problem due to license restrictions. Looking at the code it doesn't appear that there is any easy to to ensure that the returned sentence text is fully unmodified, is that correct?
Hi @akhudek. When I wrote this I wasn't too worried about keeping whitespace unmodified, so even with clean turned off it might get modified. It might be a small change to fix that, I'm not sure though without some test examples. Would you be able to provide some examples? In this case, all we care about is the white space so you can replace any words with foo
.
The data is from http://ota.ox.ac.uk/desc/2551, which is actually free but oddly enough has a license preventing redistribution. Here is a censored sample that exhibits the issue, plus several segmentation issues. Perhaps the length of the string is contributing, trying various smaller substrings produce different outputs.
AAAAAAA'A aaa Aaaaaaaaa, Aa Aaaa Aaaaaa, aaaaaaaaa aaaaaa aa aaaa aaa aaaaaaa aaaa aaaa a aaaaaaa aa aaaaaa aaa aaaaaaa aaaaaaa aaaaa aaaaaa Aaaaaaaaaa, aaaaaaaaaaaaa aaaaaaa aaa, aaaa aaaa aa Aaaaaa. Aaaaaaaaaa aa Aaaaaaa Aaaaaa Aaaa, aaaaaa aaaaa aaaaaa aaaaaaaa aa aaa Aaaaaaaaa aaaaaaa. Aaaaa 0,000 aaaaaaa aa aaa Aaaaaa aaaaaaa aa aaaaaaaa aaaaaaa aaa aaa Aaaa aaaaa aaaa aaaaa aaaaaaa Aa Aaaaaa aa aaaaa aa Aaaaaa. `Aaaaaa aaa aaaaaaaa aa," aaa aaaaaa aaaa. Aaaaaaa aaaaaaa aaaa Aaaaaaa'a Aaaaaaaaaa aaa aaaa aaa aaaa aa Aaaaaaaaa Aaaaaaa Aaaaaaaaaa. Aaa Aaaa aaaaaaaa aaa Aaaa aaaaa aaaa aaa aaaaaaa aa aaaaaa a aaaaa, aaaaa aaaaaaaaa aaa aaa aaaaaaaaaa aa Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa. Aa Aaaaaa, a aaa-Aaaaaa Aaaaaaaa Aaaaaaaaa, aaa aaaaaaaa aa aaa aaaaaaa aa Aaaaa aa aaaAaaaaa-aaaaaaaaaa aaaaa aa aaaaaaa aaaa aaaaa aaaaaaa a aaaaaaa, aaa aa aa aaaaaaaaa aaaa aaaa. Aaaaaaaa aaaa Aa Aaaaaa aaaaa aaa aaaa aa aaaaaaa aaaaaaa aaaaaaa aaa aaaa aaaa aaaaaaaa aaaa aaaaa aaa 00 aaaaa. Aa Aaaaaa aaa aaaa aaaaaaa aaaaa aaa aaaaaaaaaaa aa aaa aaaaaaa, aaa aa aaaaaaaa aa aaaaaaa Aa Aaaaa aa-Aaaa, Aaaaa Aaaaaaaa aa a aaaaa aaaaaaaaaaaaaa aa Aaa Aaaa. Aaa aaaaa aaaaaaaaa aa Aaaa Aaaaaa aaaaaaaaa a aaaaa aa aaaaaaaa aa aaa Aaaaaaaaa aaaa. Aaaaa aaaa aaa aaaaaaaa aaa aaaaaa Aaa Aaaa'a aaaaaaaa aa aaaaa aaa aaaaaaaaaaaa aaaaaaaa aaa Aaaaaaaaaa aaaa Aa Aaaaaa. Aaaaaaa'a aaaaaaaaaaaa aaaaaa aa Aa'aaaa aa aaaa aa Aaa Aaaa, aa Aa Aaaaaa aaa aaaaaaaa aa aaaa aaa aaaaaaaaaa aa Aaaaaa-aaaa Aaaa Aaaaaa. Aa Aaaaaa aaa aaaaaa aa aaa Aaaaaa aaaaa, aa-Aaaaaaa, aa aaaaaa aa aaaaa aaaa aa aaaaaaa aaaa aaaa Aaaaa. `Aaaaaaaa Aaaaaaa aaa Aaaaa aaa aaa aaaaaaaaa aaaa aaa aaa aaaaaa aa aaa aaaaaa[aaaaaa] aaaaa aaa aaaaaaaaaaa aa aaaaaaaaaaa." AAAA AAAAAAAA aaaaaa aaaaaaaaa aaa aaaaaaaaaaaaa aaaaaaaaaaaa aa a aa-aaaa aaaaa Aaaa-Aaa Aaaaa aaaaaaaaaa aaaaaaa aaaaaa aaaaaaaaa aa aa aaaaaa aa aaaa aaaa 0,000 aaa aaaaaaa aaa aaaaaa aaaaaa. Aaaa aaaaa aaaaaaaaa aaaaaaaaa, aaaaaaaa
An example of the problem is at:
Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa. Aa Aaaaaa,
where there are three spaces after the period. After segmentation I get:
Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa. Aa Aaaaaa,
Perhaps it's notable that these whitespace modifications are also places where it should have marked a sentence boundary. In this case the above snippet is returned within a single "sentence" as returned by the library.
What is your expected behavior? Does the whitespace belong to the preceding segment or the segment following the whitespace? What you describe as whitespace modification, I would argue is possibly not the responsibility of a library like this.
it "Whitespace example #1" do
ps = PragmaticSegmenter::Segmenter.new(text: "There it is! I found it.", language: "en")
expect(ps.segment).to eq(["There it is!", "I found it."])
end
it "Whitespace example #2" do
ps = PragmaticSegmenter::Segmenter.new(text: "There it is! I found it.", language: "en")
expect(ps.segment).to eq(["There it is!", " I found it."])
end
it "Whitespace example #3" do
ps = PragmaticSegmenter::Segmenter.new(text: "There it is! I found it.", language: "en")
expect(ps.segment).to eq(["There it is! ", "I found it."])
end
it "Whitespace example #4" do
ps = PragmaticSegmenter::Segmenter.new(text: "There it is! I found it.", language: "en")
expect(ps.segment).to eq(["There it is! ", " I found it."])
end
I think we can rule out example 4 above. Example 1 is the current behavior. You are proposing Example 2 or Example 3 if I am understanding you correctly.
This is Wikipedia's definition: "Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation is the problem in natural language processing of deciding where sentences begin and end". In my opinion, sentences don't begin or end with whitespace.
If I put the entire string above I get the following sentences:
AAAAAAA'A aaa Aaaaaaaaa, Aa Aaaa Aaaaaa, aaaaaaaaa aaaaaa aa aaaa aaa aaaaaaa aaaa aaaa a aaaaaaa aa aaaaaa aaa aaaaaaa aaaaaaa aaaaa aaaaaa Aaaaaaaaaa, aaaaaaaaaaaaa aaaaaaa aaa, aaaa aaaa aa Aaaaaa.
Aaaaaaaaaa aa Aaaaaaa Aaaaaa Aaaa, aaaaaa aaaaa aaaaaa aaaaaaaa aa aaa Aaaaaaaaa aaaaaaa.
Aaaaa 0,000 aaaaaaa aa aaa Aaaaaa aaaaaaa aa aaaaaaaa aaaaaaa aaa aaa Aaaa aaaaa aaaa aaaaa aaaaaaa Aa Aaaaaa aa aaaaa aa Aaaaaa.
`Aaaaaa aaa aaaaaaaa aa," aaa aaaaaa aaaa. Aaaaaaa aaaaaaa aaaa Aaaaaaa'a Aaaaaaaaaa aaa aaaa aaa aaaa aa Aaaaaaaaa Aaaaaaa Aaaaaaaaaa. Aaa Aaaa aaaaaaaa aaa Aaaa aaaaa aaaa aaa aaaaaaa aa aaaaaa a aaaaa, aaaaa aaaaaaaaa aaa aaa aaaaaaaaaa aa Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa. Aa Aaaaaa, a aaa-Aaaaaa Aaaaaaaa Aaaaaaaaa, aaa aaaaaaaa aa aaa aaaaaaa aa Aaaaa aa aaaAaaaaa-aaaaaaaaaa aaaaa aa aaaaaaa aaaa aaaaa aaaaaaa a aaaaaaa, aaa aa aa aaaaaaaaa aaaa aaaa. Aaaaaaaa aaaa Aa Aaaaaa aaaaa aaa aaaa aa aaaaaaa aaaaaaa aaaaaaa aaa aaaa aaaa aaaaaaaa aaaa aaaaa aaa 00 aaaaa. Aa Aaaaaa aaa aaaa aaaaaaa aaaaa aaa aaaaaaaaaaa aa aaa aaaaaaa, aaa aa aaaaaaaa aa aaaaaaa Aa Aaaaa aa-Aaaa, Aaaaa Aaaaaaaa aa a aaaaa aaaaaaaaaaaaaa aa Aaa Aaaa. Aaa aaaaa aaaaaaaaa aa Aaaa Aaaaaa aaaaaaaaa a aaaaa aa aaaaaaaa aa aaa Aaaaaaaaa aaaa. Aaaaa aaaa aaa aaaaaaaa aaa aaaaaa Aaa Aaaa'a aaaaaaaa aa aaaaa aaa aaaaaaaaaaaa aaaaaaaa aaa Aaaaaaaaaa aaaa Aa Aaaaaa. Aaaaaaa'a aaaaaaaaaaaa aaaaaa aa Aa'aaaa aa aaaa aa Aaa Aaaa, aa Aa Aaaaaa aaa aaaaaaaa aa aaaa aaa aaaaaaaaaa aa Aaaaaa-aaaa Aaaa Aaaaaa. Aa Aaaaaa aaa aaaaaa aa aaa Aaaaaa aaaaa, aa-Aaaaaaa, aa aaaaaa aa aaaaa aaaa aa aaaaaaa aaaa aaaa Aaaaa. `Aaaaaaaa Aaaaaaa aaa Aaaaa aaa aaa aaaaaaaaa aaaa aaa aaa aaaaaa aa aaa aaaaaa[aaaaaa] aaaaa aaa aaaaaaaaaaa aa aaaaaaaaaaa."
And a few more. The last fourth sentence should actually have been split into several sentences, but I'm leaving that aside for now. The unexpected whitespace modification occurs within the fourth sentence where the number of spaces at some places is changed. On of them is after the period here:
Aaaaa'a 00,000 aaaaaa aaaa Aaaaaaa. Aa Aaaaaa,
But there are others places too. I would have expected that if it returns a string as a sentence, the text within would be unmodified.
Could you create a test case for the above? Including the:
- Original string
- Array of the segmented strings you are expecting
If possible, it would be nice if we could isolate each issue to its own test case, with the example sentence as short as possible to still reveal the issue.
I'm going to integrate the MASC into my tests, it has a more permissive license than the BNC and I can include test cases from that. I suspect part of the issue may actually be the length of the string, but will try to narrow down.
Does the whitespace belong to the preceding segment or the segment following the whitespace? What you describe as whitespace modification, I would argue is possibly not the responsibility of a library like this... definition:" [...] where sentences begin and end". In my opinion, sentences don't begin or end with whitespace.
I would agree with that. In fact in my use case what I actually need to know is where each sentence begins and ends. So, to answer:
What is your expected behavior?
I'd say, ideally:
it "Whitespace example #1" do
ps = PragmaticSegmenter::Segmenter.new(text: "There it is! I found it.", language: "en")
expect(ps.segment).to eq([
{ begin: 0, end: 12, text: "There it is!"},
{ begin: 26, end: 36, text: "I found it."}
])
end
(Of course given begin:
and end:
then text:
is redundant, but still it would be handy when inspecting the results.)
In my use case I reconstruct the begin
and end
values by marching through the results a sentence at a time while stepping through the input, searching for each subsequent sentence after some amount of white space from the previous sentence. Doable, but would be nice if I didn't need to.
Here is a simple example:
t = "Its registration number is 322 284 696 R.C.S. Paris, and its telephone number is [ ].">
ps = PragmaticSegmenter::Segmenter.new(text: t, language: 'en', clean: false)
=> #<PragmaticSegmenter::Segmenter:0x00007fae9be30660
@doc_type=nil,
@language="en",
@language_module=PragmaticSegmenter::Languages::English,
@text="Its registration number is 322 284 696 R.C.S. Paris, and its telephone number is [ ].">
ps = PragmaticSegmenter::Segmenter.new(text: t, language: 'en', clean: false).segment
=> ["Its registration number is 322 284 696 R.C.S. Paris, and its telephone number is [ ]."]
Notice that the spaces before "and" and inside the bracket "[ ]" are removed
I ended up commenting out @language::ExtraWhiteSpaceRule
in my fork to resolve this
https://github.com/echan00/pragmatic_segmenter/commit/e5e4244bacd0bd12e65b560b648d331980fc1ce4