CoreNLP
CoreNLP copied to clipboard
Quote attributor trouble with nicknames and names themselves in quotes
** Thank you for building the CoreNLP system **
Hi Folks,
We're using CoreNLP 4.2.2. heavily for some news quote analysis and audit work at Santa Clara University. We're also using the kernel, with a substantial layer of customization (article processing) to build a DEI audit toolkit for newsrooms.
We'd like to help with or get help with improving the quote annotator's accuracy. Our use case to see how often "homeless" people were quoted in sample set of news articles. We customer the title rules file to include "homeless" as title. But surprisingly ran into a quote extraction and attribution issue in the process.
See for e.g. this below
What does not work
~~ “Last night. Some gun violence. Senseless killing,” said a homeless man who goes by the name “Shorty." ~~ [Journalists sometimes write like this -- because that's the reality they encounter on the streets of San Francisco.]
This quote does not parse for the CoreNLP system at all! The error seems to be linked to the name "Shorty" itself coming up in a way that is unusual/uncommon for nicknamed people being quoted -- NLP does not understand it. It misunderstands "Shorty" as a quote. (I have the JSON object for this.)
Secondly, when I remove the quotes around "Shorty" and test, CoreNLP still seems to think Shorty is not a legitimate name. It does not resolve the speaker. To debug I tried different variations.
What works
The following rewritten quotes work -- in the sense that it picks up "homeless" as a title.
“Last night. Some gun violence. Senseless killing,” said Subbu Shorty, a homeless man. “Last night. Some gun violence. Senseless killing,” said Shorty Subbu, a homeless man. “Last night. Some gun violence. Senseless killing,” said Subbu, a homeless man. “Last night. Some gun violence. Senseless killing,” said Subbu, a homeless man.
Original news article we parsed the full text for: https://sanfrancisco.cbslocal.com/2016/12/19/homeless-man-slain-in-san-francisco-shooting-remembered/
About our work: [We've built a DEI Annotation API service, A WordPress plugin to use it, a source diversity monitor web app, etc. all with the CoreNLP kernel underlying.] https://www.scu.edu/ethics/focus-areas/journalism-and-media-ethics/resources/journalism-source-diversity-dashboard-and-monitor/
I'm happy to provide more context. There are ton more language examples from my work. Anything that help improve this system is going to help journalists audit their work on demand with far less stress than is currently the case.
This issue may be related to issue #1090
Some of the issues here go a bit deeper than this. For example:
“Last night. Some gun violence. Senseless killing,” said a homeless man who goes by the name “Shorty."
vs
“Last night. Some gun violence. Senseless killing,” said a homeless man who goes by the name "Shorty."
The closing quote for Shorty gets attached to the previous sentence in the case of curly quotes, but not in straight quotes. The reason is that the logic in WordToSentenceProcessor::plausibleToAdd doesn't account for quotes with multiple sentences in them. But then it also doesn't count curly quotes at all, which means the miscounts cancel out
thanks for this comment @AngledLuffa. I did not realize the curly quotes-bit till you pointed it out. It's useful and it gives us an idea to test. I'll be back with more after reviewing with my group.