textacy
textacy copied to clipboard
Improve quotation detection by parsing quotation mark types
Problem:
The current code for detecting quotes is pretty unsophisticated. It just sequentially pairs anything the token.is_quote
deems a quotation mark and assumes the indexes to be the quote boundaries. If there are an odd number of quotation marks, it throws an error.
Solution:
I've been doing quote detection in some of unreliably formatted text lately which has things like "»" used as bullet points and lots of unpredictable stray characters, so I came up with a workaround. I updated the quote detection functionality to only return quotes whose starting and ending code points match a set of pre-determined pairs.
For example:
Bill told me I "shouldn‘t wear those pants" but I will.
In the current version, running quote detection here would raise an error because there are three quotation mark-like tokens in the sentence. Even if it didn't, it would return "shouldn" as a quote because textacy assumes sequential quotation marks are quote boundaries.
My version takes the first quotation mark (q) and iterates through all the later quotation marks until it finds one (q_) where (ord(q.text), ord(q_.text))
is in the list of acceptable pairs.