sentences
sentences copied to clipboard
double-newlines should always start new sentence?
I noticed this in the context of cited quotations like
I think there's a bug here. — me
And then another paragraph.
I think that should be 3 "sentences". The double-newline might be a reliable clue: continuing a sentence from one paragraph to the next is at least uncommon if not disallowed, right? (depending whether you want to keep them together if one paragraph ends with an ellipsis and the next starts with an ellipsis, perhaps) Another way would be to recognize this cited-quotation form, but I guess that could be risky.
diff --git a/sentences_test.go b/sentences_test.go
index e506188..d178f09 100644
--- a/sentences_test.go
+++ b/sentences_test.go
@@ -174,6 +174,19 @@ func TestSpacedPeriod(t *testing.T) {
compareSentence(t, actualText, expected)
}
+func TestQuotationSourceAndDoubleNewlines(t *testing.T) {
+ t.Log("Tokenizer should treat double-newline as end of sentence regardless of ending punctuation")
+
+ actualText := "'A witty saying proves nothing.' — Voltaire\n\nAnd yet it commands attention."
+ expected := []string{
+ "'A witty saying proves nothing.'",
+ " — Voltaire",
+ "And yet it commands attention.",
+ }
+
+ compareSentence(t, actualText, expected)
+}
+
I was poking around; I see you have token.ParaStart being set sometimes when a double-newline is detected, but treating ParaStart the same as SentBreak in Tokenize() didn't fix it.
Greetings! Would a quotation by-line be considered a separate sentence or part of the quoted sentence? It's unclear what the rules are for this one. I'll happy accept any proposals as well as PRs to address this issue.
I don't know. But I suppose a quote would be likely to end with a period, and a quote could also consist of more than one sentence; so maybe the byline should be separate?