TextBlob icon indicating copy to clipboard operation
TextBlob copied to clipboard

Preserving contractions

Open knyghty opened this issue 7 years ago • 3 comments

Is there any to preserve contractions when using TextBlob?

For example I'd like to do something like:

text = 'I don't like it."
TextBlob.words

and have ['I', "don't", 'like', 'it'] instead of ['I', 'do', "n't", 'like', 'it]

I'm not aware of any tokenisers that will do this and I don't feel anything I can hack together will be good enough.

knyghty avatar Apr 20 '17 16:04 knyghty

You could resort to post-processing the tokens and join together any contiguous ["do", "n't"] pair if nothing else can be done.

On Thu, Apr 20, 2017 at 9:04 AM, Tom Carrick [email protected] wrote:

Is there any to preserve contractions when using TextBlob?

For example I'd like to do something like:

text = 'I don't like it." TextBlob.words

and have ['I', "don't", 'like', 'it'] instead of `['I', 'do', "n't", 'like', 'it]

I'm not aware of any tokenisers that will do this and I don't feel anything I can hack together will be good enough.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sloria/TextBlob/issues/158, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_2rLWnZTpyE-O7Z897XENB720tpqtzks5rx4ITgaJpZM4NDPmp .

hugomailhot avatar Apr 20 '17 18:04 hugomailhot

TextBlob calls word tokenizer from the nltk package. You can find that class in nltk.tokenize.treebank.TreebankWordTokenizer. You'll find the regular expression in the ENDING_QUOTES list.

    ENDING_QUOTES = [
        (re.compile(r'"'), " '' "),
        (re.compile(r'(\S)(\'\')'), r'\1 \2 '),

        (re.compile(r"([^' ])('[sS]|'[mM]|'[dD]|') "), r"\1 \2 "),
        **# (re.compile(r"([^' ])('ll|'LL|'re|'RE|'ve|'VE|n't|N'T) "), r"\1 \2 "),**  <--- this line

i commented out that line for laughs and giggles and got this: ╰─$ python mypy.py ['i', "don't", 'like', 'it']

phamhm avatar Apr 22 '17 20:04 phamhm

I ended up making a Tokenizer that looks like this, and then using .tokens instead of .words:

import re

from nltk.tokenize.treebank import TreebankWordTokenizer
from textblob.utils import strip_punc


class ContractionPreservingTokenizer(TreebankWordTokenizer):
    ENDING_QUOTES = [
        (re.compile(r'"'), " '' "),
        (re.compile(r'(\S)(\'\')'), r'\1 \2 '),
    ]

    def tokenize(self, text):
        tokens = super().tokenize(text)
        return [word if word.startswith("'") else strip_punc(word, all=False)
                for word in tokens if strip_punc(word, all=False)]

It seems this is probably the best solution with the current codebase.

knyghty avatar Apr 30 '17 16:04 knyghty