TextBlob
TextBlob copied to clipboard
Preserving contractions
Is there any to preserve contractions when using TextBlob?
For example I'd like to do something like:
text = 'I don't like it."
TextBlob.words
and have ['I', "don't", 'like', 'it']
instead of ['I', 'do', "n't", 'like', 'it]
I'm not aware of any tokenisers that will do this and I don't feel anything I can hack together will be good enough.
You could resort to post-processing the tokens and join together any contiguous ["do", "n't"] pair if nothing else can be done.
On Thu, Apr 20, 2017 at 9:04 AM, Tom Carrick [email protected] wrote:
Is there any to preserve contractions when using TextBlob?
For example I'd like to do something like:
text = 'I don't like it." TextBlob.words
and have ['I', "don't", 'like', 'it'] instead of `['I', 'do', "n't", 'like', 'it]
I'm not aware of any tokenisers that will do this and I don't feel anything I can hack together will be good enough.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sloria/TextBlob/issues/158, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_2rLWnZTpyE-O7Z897XENB720tpqtzks5rx4ITgaJpZM4NDPmp .
TextBlob calls word tokenizer from the nltk package. You can find that class in nltk.tokenize.treebank.TreebankWordTokenizer. You'll find the regular expression in the ENDING_QUOTES list.
ENDING_QUOTES = [
(re.compile(r'"'), " '' "),
(re.compile(r'(\S)(\'\')'), r'\1 \2 '),
(re.compile(r"([^' ])('[sS]|'[mM]|'[dD]|') "), r"\1 \2 "),
**# (re.compile(r"([^' ])('ll|'LL|'re|'RE|'ve|'VE|n't|N'T) "), r"\1 \2 "),** <--- this line
i commented out that line for laughs and giggles and got this:
╰─$ python mypy.py ['i', "don't", 'like', 'it']
I ended up making a Tokenizer that looks like this, and then using .tokens
instead of .words
:
import re
from nltk.tokenize.treebank import TreebankWordTokenizer
from textblob.utils import strip_punc
class ContractionPreservingTokenizer(TreebankWordTokenizer):
ENDING_QUOTES = [
(re.compile(r'"'), " '' "),
(re.compile(r'(\S)(\'\')'), r'\1 \2 '),
]
def tokenize(self, text):
tokens = super().tokenize(text)
return [word if word.startswith("'") else strip_punc(word, all=False)
for word in tokens if strip_punc(word, all=False)]
It seems this is probably the best solution with the current codebase.