TextBlob
TextBlob copied to clipboard
add ability to parse relationships
TextBlob is wonderful but it doesn't seem to provide relationship extraction.
One way to handle this might be to call/construct TextBlob's parse with the pattern.en.parse() relation arg set to True.
E.G. pattern.en.parse() has the following options (http://www.clips.ua.ac.be/pages/pattern-en#parser): parse(string, tokenize = True, # Split punctuation marks from words? tags = True, # Parse part-of-speech tags? (NN, JJ, ...) chunks = True, # Parse chunks? (NP, VP, PNP, ...) relations = False, # Parse chunk relations? (-SBJ, -OBJ, ...) lemmata = False, # Parse lemmata? (ate => eat) encoding = 'utf-8' # Input string encoding. tagset = None) # Penn Treebank II (default) or UNIVERSAL.
If I'm not mistaken, if we could somehow pass relations=True, it would effectively add a new feature to TextBlob.
FWIW, although TextBlob's Parser.parse ostensibly accepts the full set of parameters:
from textblob.en import Parser
print Parser().parse.__doc__
Takes a string (sentences) and returns a tagged Unicode string (TaggedString).
Sentences in the output are separated by newlines.
With tokenize=True, punctuation is split from words and sentences are separated by
.
With tags=True, part-of-speech tags are parsed (NN, VB, IN, ...).
With chunks=True, phrase chunk tags are parsed (NP, VP, PP, PNP, ...).
With relations=True, semantic role labels are parsed (SBJ, OBJ).
With lemmata=True, word lemmata are parsed.
Optional parameters are passed to
the tokenizer, tagger, chunker, labeler and lemmatizer.
In practice relations= is not implemented:
Parser().parse("The cat sat on the mat.", relations=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/wjt/.virtualenvs/fewerror/local/lib/python2.7/site-packages/textblob/_text.py", line 1248, in parse
s[i] = self.find_labels(s[i], **kwargs)
File "/home/wjt/.virtualenvs/fewerror/local/lib/python2.7/site-packages/textblob/_text.py", line 1208, in find_labels
return find_relations(tokens)
NameError: global name 'find_relations' is not defined
I see. Would y'all be interested in having that implementation contributed? If so are there relevant parser tests to add to, as well? On Mar 31, 2015 11:38 PM, "Will Thompson" [email protected] wrote:
FWIW, although TextBlob's Parser.parse ostensibly accepts the full set of parameters:
from textblob.en import Parserprint Parser().parse.doc
Takes a string (sentences) and returns a tagged Unicode string (TaggedString). Sentences in the output are separated by newlines. With tokenize=True, punctuation is split from words and sentences are separated by . With tags=True, part-of-speech tags are parsed (NN, VB, IN, ...). With chunks=True, phrase chunk tags are parsed (NP, VP, PP, PNP, ...). With relations=True, semantic role labels are parsed (SBJ, OBJ). With lemmata=True, word lemmata are parsed. Optional parameters are passed to the tokenizer, tagger, chunker, labeler and lemmatizer.
In practice relations= is not implemented:
Parser().parse("The cat sat on the mat.", relations=True)
Traceback (most recent call last): File "
", line 1, in File "/home/wjt/.virtualenvs/fewerror/local/lib/python2.7/site-packages/textblob/_text.py", line 1248, in parse s[i] = self.find_labels(s[i], **kwargs) File "/home/wjt/.virtualenvs/fewerror/local/lib/python2.7/site-packages/textblob/_text.py", line 1208, in find_labels return find_relations(tokens)NameError: global name 'find_relations' is not defined — Reply to this email directly or view it on GitHub https://github.com/sloria/TextBlob/issues/78#issuecomment-88369897.
I worked on a similar issue when creating the German language extension for TextBlob. Unfortunately, I do not have the time to implement this for the English version right now. @jburb if you're interested in contributing, you could have a look at http://github.com/markuskiller/textblob-de.
I separated the pattern and textblob code base completely, making it compatible with the original pattern library on Python2 and using a vendorised (only minimally adapted) version on Python3 (see commits between 27 July and 4 Aug 2014):
https://github.com/markuskiller/textblob-de/commits/dev?page=4
https://github.com/markuskiller/textblob-de/commits/dev?page=3
This solution is not very efficient, as it creates a separate set of Tree(), Sentence(), Word(), ... objects, but it was the only way I managed to retain the full range of pattern's parser options.
I have created something of a kludgy work around using the pattern package:
`jeffs@jeff-desktop:~/skyset/NLP$ python Python 2.7.12 (default, Jul 1 2016, 15:12:24) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys print sys.version 2.7.12 (default, Jul 1 2016, 15:12:24) [GCC 5.4.0 20160609] import pattern from pattern.text import find_relations print(pattern.text.parse("I ate my pizza with a fork", relations=True)) I/PRP/B-NP/O/NP-SBJ-1 ate/VBD/B-VP/O/VP-1 my/PRP$/B-NP/O/NP-OBJ-1 pizza/NN/I-NP/O/NP-OBJ-1 with/IN/B-PP/B-PNP/O a/DT/B-NP/I-PNP/O fork/NN/I-NP/I-PNP/O `
pattern does not work with python3 :-(.