sentences icon indicating copy to clipboard operation
sentences copied to clipboard

double-newlines should always start new sentence?

Open ec1oud opened this issue 2 years ago • 2 comments

I noticed this in the context of cited quotations like

I think there's a bug here.  — me

And then another paragraph.

I think that should be 3 "sentences". The double-newline might be a reliable clue: continuing a sentence from one paragraph to the next is at least uncommon if not disallowed, right? (depending whether you want to keep them together if one paragraph ends with an ellipsis and the next starts with an ellipsis, perhaps) Another way would be to recognize this cited-quotation form, but I guess that could be risky.

diff --git a/sentences_test.go b/sentences_test.go
index e506188..d178f09 100644
--- a/sentences_test.go
+++ b/sentences_test.go
@@ -174,6 +174,19 @@ func TestSpacedPeriod(t *testing.T) {
        compareSentence(t, actualText, expected)
 }
 
+func TestQuotationSourceAndDoubleNewlines(t *testing.T) {
+       t.Log("Tokenizer should treat double-newline as end of sentence regardless of ending punctuation")
+
+       actualText := "'A witty saying proves nothing.' — Voltaire\n\nAnd yet it commands attention."
+       expected := []string{
+               "'A witty saying proves nothing.'",
+               " — Voltaire",
+               "And yet it commands attention.",
+       }
+
+       compareSentence(t, actualText, expected)
+}
+

I was poking around; I see you have token.ParaStart being set sometimes when a double-newline is detected, but treating ParaStart the same as SentBreak in Tokenize() didn't fix it.

ec1oud avatar May 29 '22 21:05 ec1oud

Greetings! Would a quotation by-line be considered a separate sentence or part of the quoted sentence? It's unclear what the rules are for this one. I'll happy accept any proposals as well as PRs to address this issue.

neurosnap avatar Jun 02 '22 16:06 neurosnap

I don't know. But I suppose a quote would be likely to end with a period, and a quote could also consist of more than one sentence; so maybe the byline should be separate?

ec1oud avatar Jun 04 '22 20:06 ec1oud