CoreNLP
CoreNLP copied to clipboard
Quotation attribution: gatherQuotes returns a set of quotes that have unexpected index numbers
OS: Linux Mint 20.1 Ulyssa (base: Ubuntu 20.04 focal) Java: openjdk version "11.0.9.1" CoreNLP: 4.2.0 (also 4.1.0, and the dev branch with commit 040b846a428a34373e4854bfee138c70f5d50a1d as the HEAD)
Command line:
java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,coref,quote -file bug1input.txt -outputFormat text
Given the attached file as input bug1input.txt, the output results in this file: bug1output.txt
Note the Extracted Quotes section near the bottom of the output file.
Expected: The index numbers of the quotes should begin at 0 and be a continuous range of integers without duplicates. Actual: Index number 7 does not appear and Index number 8 is used twice, being in fact the same quote appearing two times.
Extracted quotes:
...
Unknown: “I don't trust pharmaceuticals. I really don't. And it doesn't sound like it's going to be safe,” [index=8, charOffsetBegin=2370]
Unknown: “I don't trust pharmaceuticals. I really don't. And it doesn't sound like it's going to be safe,” [index=8, charOffsetBegin=2370]
...
I discovered this problem while storing attributed quotes after calling gatherQuotes()
. My assumption is that the quote index is unique to each quote within a given text and that each quote index should appear only once in the returned set. Is this assumption correct, or have I misunderstood? I've seen this happen in other instances as well, so it's not unique to this body of text. The command line execution also generates the following warning:
WARNING: unmatched quote of type " found at index 1905 in text segment: You’re going to need to get quite large proportions of the population vaccinated before you see a real effect."
About 33.8 million Americans, or 10% o...
[edit: additional details follow]
Some observations about the input text
The input text does not actually contain any embedded quotes. However, it uses a mix of simple undirected quotes and "
and directed quotes “”
. In particular, there is one quote that begins with a directed opening quote but ends with an undirected quote:
“You’re going to need to get quite large proportions of the population vaccinated before you see a real effect."
... and one that begins with an undirected quote but ends with a directed closing quote:
"I feel like I have plenty of time before I get a chance to get (the vaccine) anyway, to find out if there are bad side effects and whether it’s even worth getting it,”
CoreNLP does not recognize this error (expects too much of mere humans :-) ), and so sees large areas of the text as a series of embedded quotes that are not actual quotes.
I have dug into this a little deeper and there appear to be two different issues at play. One has to do with the private method QuoteAnnotator.setQuoteIndices
which I will attempt to detail in a separate comment, although it's possible that a solution there is all that is needed. The second problem is solvable, if not exactly where the cause is, within the QuoteAnnotator.gatherQuotes
method which I discuss here.
After annotation, looking at the above mentioned bug1output.txt
file, embedded within Quote 1 are Quotes 6 and 8. Embedded within Quote 6 is Quote 8. Therefore, Quote 8 appears at both depth levels 2 and 3. QuoteAnnotator.gatherQuotes returns all quotes as a flat collection and so Quote 8 appears twice in the collection.
Fortunately, Quote 8 is exactly the same object in both places and so a potential solution is to filter out duplicates from the returned collection using simple object identity comparison.
However, a separate question is should Quote 8 appear as a sibling to Quote 6 and a child of Quote 6 (embedded in Quote 6)? If this is an unexpected result then the problem is happening at the annotation phase and fixing it here will just mask the true problem. If it's normal, then I propose implementing the duplicate filter.
After a bit more experimentation, I discovered that setting quote.asciiQuotes
to true
solves this problem for my purposes, at least for input text such as the one presented. Embedded quotations are not the norm in the text I am processing and so I am anticipating few side effects from that setting. There may be a robustness problem in the quotation annotator worth addressing nevertheless, but I will leave that for the CoreNLP maintainers to decide.
I will note quickly that as I was stepping through QuoteAnnotator.setQuoteIndices
to get a sense of the original problem, I observed that what was ultimately identified as Quote 8 starts out with index = 7, but then the index of that same object is later overwritten to become 8 on the next pass through the loop. I suspect this is a consequence of that quote being both a sibling and child of Quote 6.