CoreNLP
CoreNLP copied to clipboard
Quotation Attribution - quotation extraction improvement: properly extract quotes that span paragraphs
A common occurrence in English text around the world, especially in news articles, is use of the following convention (sourced from wikipedia):
The convention in English is to give opening quotation marks to the first and each subsequent paragraph, using closing quotation marks only for the final paragraph of the quotation.
Here is just one example extracted from here:
“What we are watching is with what speed can we control the variants,” Dubé said. “We have seen the conventional cases drop but the variants increase.
“Until now, we have been able to stay within the 800 mark (of daily new cases, including variants), but it’s something we’re following.
“I ask you to follow the rules, because when we see the risks associated with March break and the risks associated with variants, I would not want to have to back up in the coming weeks. I think it’s important that we come out of March break the right way.”
Note that the second quotation begins part way through the first paragraph and does not end until the end of the third paragraph. Therefore, the second quotation in the above text is:
We have seen the conventional cases drop but the variants increase.
Until now, we have been able to stay within the 800 mark (of daily new cases, including variants), but it’s something we’re following.
I ask you to follow the rules, because when we see the risks associated with March break and the risks associated with variants, I would not want to have to back up in the coming weeks. I think it’s important that we come out of March break the right way.
I have also seen examples where the paragraph-spanning quote closes part way through it's last paragraph rather than only at the end.
Ideally, CoreNLP would extract the entire unbroken quotation as shown above. Currently, CoreNLP does not extract the part of the quotation that starts in the first paragraph, nor does it recognize the second paragraph as being part of a quote or even a quote on its own. These are not even recorded in UnclosedQuotationsAnnotation when quote.extractUnclosedQuotes is set to true. The final paragraph is, as one would expect given the opening and closing quotation marks, successfully extracted as a quote, but only on its own. The earlier parts of the quote are not annotated as quotes or part thereof whatsoever.
Therefore, to sum up: when annotating quotations (at least those in the English language), CoreNLP should recognize paragraph-spanning quotations as such. That is, those quotations that end a paragraph without a closing quotation mark and continue onto the next paragraph with an opening quotation mark, and so on until a closing quotation mark is found.