TextBlob icon indicating copy to clipboard operation
TextBlob copied to clipboard

add ability to parse relationships

Open jburb opened this issue 10 years ago • 4 comments

TextBlob is wonderful but it doesn't seem to provide relationship extraction.

One way to handle this might be to call/construct TextBlob's parse with the pattern.en.parse() relation arg set to True.

E.G. pattern.en.parse() has the following options (http://www.clips.ua.ac.be/pages/pattern-en#parser): parse(string, tokenize = True, # Split punctuation marks from words? tags = True, # Parse part-of-speech tags? (NN, JJ, ...) chunks = True, # Parse chunks? (NP, VP, PNP, ...) relations = False, # Parse chunk relations? (-SBJ, -OBJ, ...) lemmata = False, # Parse lemmata? (ate => eat) encoding = 'utf-8' # Input string encoding. tagset = None) # Penn Treebank II (default) or UNIVERSAL.

If I'm not mistaken, if we could somehow pass relations=True, it would effectively add a new feature to TextBlob.

jburb avatar Feb 04 '15 16:02 jburb

FWIW, although TextBlob's Parser.parse ostensibly accepts the full set of parameters:

from textblob.en import Parser
print Parser().parse.__doc__
 Takes a string (sentences) and returns a tagged Unicode string (TaggedString).
            Sentences in the output are separated by newlines.
            With tokenize=True, punctuation is split from words and sentences are separated by 
.
            With tags=True, part-of-speech tags are parsed (NN, VB, IN, ...).
            With chunks=True, phrase chunk tags are parsed (NP, VP, PP, PNP, ...).
            With relations=True, semantic role labels are parsed (SBJ, OBJ).
            With lemmata=True, word lemmata are parsed.
            Optional parameters are passed to
            the tokenizer, tagger, chunker, labeler and lemmatizer.

In practice relations= is not implemented:

Parser().parse("The cat sat on the mat.", relations=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wjt/.virtualenvs/fewerror/local/lib/python2.7/site-packages/textblob/_text.py", line 1248, in parse
    s[i] = self.find_labels(s[i], **kwargs)
  File "/home/wjt/.virtualenvs/fewerror/local/lib/python2.7/site-packages/textblob/_text.py", line 1208, in find_labels
    return find_relations(tokens)
NameError: global name 'find_relations' is not defined

wjt avatar Apr 01 '15 06:04 wjt

I see. Would y'all be interested in having that implementation contributed? If so are there relevant parser tests to add to, as well? On Mar 31, 2015 11:38 PM, "Will Thompson" [email protected] wrote:

FWIW, although TextBlob's Parser.parse ostensibly accepts the full set of parameters:

from textblob.en import Parserprint Parser().parse.doc

Takes a string (sentences) and returns a tagged Unicode string (TaggedString). Sentences in the output are separated by newlines. With tokenize=True, punctuation is split from words and sentences are separated by . With tags=True, part-of-speech tags are parsed (NN, VB, IN, ...). With chunks=True, phrase chunk tags are parsed (NP, VP, PP, PNP, ...). With relations=True, semantic role labels are parsed (SBJ, OBJ). With lemmata=True, word lemmata are parsed. Optional parameters are passed to the tokenizer, tagger, chunker, labeler and lemmatizer.

In practice relations= is not implemented:

Parser().parse("The cat sat on the mat.", relations=True)

Traceback (most recent call last): File "", line 1, in File "/home/wjt/.virtualenvs/fewerror/local/lib/python2.7/site-packages/textblob/_text.py", line 1248, in parse s[i] = self.find_labels(s[i], **kwargs) File "/home/wjt/.virtualenvs/fewerror/local/lib/python2.7/site-packages/textblob/_text.py", line 1208, in find_labels return find_relations(tokens)NameError: global name 'find_relations' is not defined

— Reply to this email directly or view it on GitHub https://github.com/sloria/TextBlob/issues/78#issuecomment-88369897.

jburb avatar Apr 01 '15 15:04 jburb

I worked on a similar issue when creating the German language extension for TextBlob. Unfortunately, I do not have the time to implement this for the English version right now. @jburb if you're interested in contributing, you could have a look at http://github.com/markuskiller/textblob-de.

I separated the pattern and textblob code base completely, making it compatible with the original pattern library on Python2 and using a vendorised (only minimally adapted) version on Python3 (see commits between 27 July and 4 Aug 2014): https://github.com/markuskiller/textblob-de/commits/dev?page=4 https://github.com/markuskiller/textblob-de/commits/dev?page=3

This solution is not very efficient, as it creates a separate set of Tree(), Sentence(), Word(), ... objects, but it was the only way I managed to retain the full range of pattern's parser options.

markuskiller avatar Apr 01 '15 17:04 markuskiller

I have created something of a kludgy work around using the pattern package:

`jeffs@jeff-desktop:~/skyset/NLP$ python Python 2.7.12 (default, Jul 1 2016, 15:12:24) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import sys print sys.version 2.7.12 (default, Jul 1 2016, 15:12:24) [GCC 5.4.0 20160609] import pattern from pattern.text import find_relations print(pattern.text.parse("I ate my pizza with a fork", relations=True)) I/PRP/B-NP/O/NP-SBJ-1 ate/VBD/B-VP/O/VP-1 my/PRP$/B-NP/O/NP-OBJ-1 pizza/NN/I-NP/O/NP-OBJ-1 with/IN/B-PP/B-PNP/O a/DT/B-NP/I-PNP/O fork/NN/I-NP/I-PNP/O `

pattern does not work with python3 :-(.

jeffsilverm avatar Oct 24 '16 23:10 jeffsilverm