textblob-de
textblob-de copied to clipboard
Strange tagging for some verb forms of "sein" and for "du"
I stumbled over some strange tagging and was wondering why it won't correctly recognize "bist" and "seid" as verb forms of "sein", though they are listed in the "de-verbs.txt" file. Also tagging the personal pronoun "du" as an adjective doesn't make much sense either.
>>> blob=TextBlob("Ich bin. Du bist. Er ist. Wir sind. Ihr seid. Sie sind.",
parser=PatternParser(pprint=True, lemmata=True))
>>> blob.parse()
WORD TAG CHUNK ROLE ID PNP LEMMA
Ich PRP NP - - - ich
bin VB VP - - - sein
. . - - - - .
Du JJ NP - - - du
bist NN NP ^ - - - bist
. . - - - - .
Er PRP NP - - - er
ist VB VP - - - sein
. . - - - - .
Wir PRP NP - - - wir
sind VB VP - - - sein
. . - - - - .
Ihr PRP$ NP - - - ihr
seid NN NP ^ - - - seid
. . - - - - .
Sie PRP NP - - - sie
sind VB VP - - - sein
. . - - - - .
Thanks for the report. Unfortunately, this seems to be an issue of the pattern
library. This library is used by textblob-de
without changes to the source code other than making it Python3 compatible. Would be great, if you could report it directly to the pattern
project under: https://github.com/clips/pattern/issues
You could use the following test or provide a link to this issue for them to be able to verify the strange behaviour:
# Tested on Python2.7.8, 32bit, on Windows 8.1 (64bit)
# pattern.__version__
# '2.6'
In [1]: from pattern.de import parse, pprint
In [2]: pprint(parse("Ich bin. Du bist. Er ist. Wir sind. Ihr seid. Sie sind.", lemmata=True))
WORD TAG CHUNK ROLE ID PNP LEMMA
Ich PRP NP - - - ich
bin VB VP - - - sein
. . - - - - .
WORD TAG CHUNK ROLE ID PNP LEMMA
Du PRP NP - - - du
bist NN NP ^ - - - bist
. . - - - - .
WORD TAG CHUNK ROLE ID PNP LEMMA
Er PRP NP - - - er
ist VB VP - - - sein
. . - - - - .
WORD TAG CHUNK ROLE ID PNP LEMMA
Wir PRP NP - - - wir
sind VB VP - - - sein
. . - - - - .
WORD TAG CHUNK ROLE ID PNP LEMMA
Ihr PRP$ NP - - - ihr
seid NN NP ^ - - - seid
. . - - - - .
WORD TAG CHUNK ROLE ID PNP LEMMA
Sie PRP NP - - - sie
sind VB VP - - - sein
. . - - - - .
In [3]: pprint(parse("Ihr seid alle herzlich eingeladen zu meinem Geburtstagsfest.", lemmata=True))
WORD TAG CHUNK ROLE ID PNP LEMMA
Ihr PRP$ NP - - - ihr
seid NN NP ^ - - - seid
alle RB ADJP - - - alle
herzlich JJ ADJP ^ - - - herzlich
eingeladen VBN VP - - - einladen
zu IN PP - - PNP zu
meinem PRP$ NP - - PNP meinem
Geburtstagsfest NN NP ^ - - PNP geburtstagsfest
. . - - - - .
In [4]: pprint(parse("Du bist herzlich eingeladen zu meinem Geburtstagsfest.", lemmata=True))
WORD TAG CHUNK ROLE ID PNP LEMMA
Du PRP NP - - - du
bist NN NP ^ - - - bist
herzlich JJ ADJP - - - herzlich
eingeladen VBN VP - - - einladen
zu IN PP - - PNP zu
meinem PRP$ NP - - PNP meinem
Geburtstagsfest NN NP ^ - - PNP geburtstagsfest
. . - - - - .
Thanks, for your fast reply. I'm using Python 3.4 64bit on Windows 8.1 and need to investigate further. (I'm half through with the NLTK book and was thinking about starting such a project myself, w hen I saw that you already started a project for German. First of all, thanks for that. ^^ I'll still need to adjust it for German dialects anyway, as we are using many different ones in our German chat room. Also many smilies are missing, at least for us.)
Thanks for further investigating the issue and contributing your results to the pattern
project. They seem to be rather streched for resources. What I like about the pattern
implementation is that it is pure python and its lemmatization is quite fast compared to other taggers. However, accuracy is a major problem. I've been working on textblob-rftagger
for a while and the results are promising. It's not ready for public release yet, but if you contact me via email, I could invite you to the bitbucket-repo (if you're interested).
By the way, do you know whether rftagger is open source?
@mk270 rftagger is open source and its source code is available under the following links:
project page: http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/
source code: http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/data/RFTagger.tar.gz
However, it is "freely available for education, research and other non-commercial purposes" only. If this is not a problem for your project, feel free to contact me via email for me to be able to invite you to the bitbucket repository of textblob-rftagger
. It is fully functional and working on WIN/OSX & Linux. The only reason I'm holding it back is because I haven't had the time to sort out a sensible and secure way of distributing the included binaries and because I'm unsure about the licensing concerning these binaries.
So, "available for education, research and other non-commercial purposes" is fairly canonically NOT open source, see for instance http://opensource.org/osd-annotated section 6.
It's a shame. I presume, since they've chosen to exclude commercial use, that they're not going to be biddable.
Thanks for the link. I interpreted 'open source' as 'is the source code available/accessible' (i.e. can it be modified/tweaked, etc.), which it is.
On other projects they released under a similarly restrictive license, they added:
"In order to use the TreeTagger commercially, you need to obtain a commercial license (see contact address below)! " (Source: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). So, I assume that they might be willing to make a decision/offer/quote on a per project basis and that they want to know what the project is about if you intend to use their software commercially.
Yes, indeed, it's not remotely open source in that term's conventional acceptation - it's proprietary.
I am asking as it's a dependency of another project I'm interested in. Ah well.