textblob-de icon indicating copy to clipboard operation
textblob-de copied to clipboard

Strange tagging for some verb forms of "sein" and for "du"

Open CJAnti opened this issue 10 years ago • 8 comments

I stumbled over some strange tagging and was wondering why it won't correctly recognize "bist" and "seid" as verb forms of "sein", though they are listed in the "de-verbs.txt" file. Also tagging the personal pronoun "du" as an adjective doesn't make much sense either.

>>> blob=TextBlob("Ich bin. Du bist. Er ist. Wir sind. Ihr seid. Sie sind.",
                  parser=PatternParser(pprint=True, lemmata=True))
>>> blob.parse()
          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA   

           Ich   PRP    NP      -      -      -      ich     
           bin   VB     VP      -      -      -      sein    
             .   .      -       -      -      -      .       
            Du   JJ     NP      -      -      -      du      
          bist   NN     NP ^    -      -      -      bist    
             .   .      -       -      -      -      .       
            Er   PRP    NP      -      -      -      er      
           ist   VB     VP      -      -      -      sein    
             .   .      -       -      -      -      .       
           Wir   PRP    NP      -      -      -      wir     
          sind   VB     VP      -      -      -      sein    
             .   .      -       -      -      -      .       
           Ihr   PRP$   NP      -      -      -      ihr     
          seid   NN     NP ^    -      -      -      seid    
             .   .      -       -      -      -      .       
           Sie   PRP    NP      -      -      -      sie     
          sind   VB     VP      -      -      -      sein    
             .   .      -       -      -      -      .     

CJAnti avatar Oct 25 '14 03:10 CJAnti

Thanks for the report. Unfortunately, this seems to be an issue of the pattern library. This library is used by textblob-de without changes to the source code other than making it Python3 compatible. Would be great, if you could report it directly to the pattern project under: https://github.com/clips/pattern/issues

You could use the following test or provide a link to this issue for them to be able to verify the strange behaviour:


# Tested on Python2.7.8, 32bit, on Windows 8.1 (64bit)

# pattern.__version__ 
# '2.6'

In [1]: from pattern.de import parse, pprint

In [2]: pprint(parse("Ich bin. Du bist. Er ist. Wir sind. Ihr seid. Sie sind.", lemmata=True))
          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Ich   PRP    NP      -      -      -      ich
           bin   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

            Du   PRP    NP      -      -      -      du
          bist   NN     NP ^    -      -      -      bist
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

            Er   PRP    NP      -      -      -      er
           ist   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Wir   PRP    NP      -      -      -      wir
          sind   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Ihr   PRP$   NP      -      -      -      ihr
          seid   NN     NP ^    -      -      -      seid
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Sie   PRP    NP      -      -      -      sie
          sind   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

In [3]: pprint(parse("Ihr seid alle herzlich eingeladen zu meinem Geburtstagsfest.", lemmata=True))
           WORD   TAG    CHUNK    ROLE   ID     PNP    LEMMA

            Ihr   PRP$   NP       -      -      -      ihr
           seid   NN     NP ^     -      -      -      seid
           alle   RB     ADJP     -      -      -      alle
       herzlich   JJ     ADJP ^   -      -      -      herzlich
     eingeladen   VBN    VP       -      -      -      einladen
             zu   IN     PP       -      -      PNP    zu
         meinem   PRP$   NP       -      -      PNP    meinem
Geburtstagsfest   NN     NP ^     -      -      PNP    geburtstagsfest
              .   .      -        -      -      -      .

In [4]: pprint(parse("Du bist herzlich eingeladen zu meinem Geburtstagsfest.", lemmata=True))
           WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

             Du   PRP    NP      -      -      -      du
           bist   NN     NP ^    -      -      -      bist
       herzlich   JJ     ADJP    -      -      -      herzlich
     eingeladen   VBN    VP      -      -      -      einladen
             zu   IN     PP      -      -      PNP    zu
         meinem   PRP$   NP      -      -      PNP    meinem
Geburtstagsfest   NN     NP ^    -      -      PNP    geburtstagsfest
              .   .      -       -      -      -      .

markuskiller avatar Oct 25 '14 16:10 markuskiller

Thanks, for your fast reply. I'm using Python 3.4 64bit on Windows 8.1 and need to investigate further. (I'm half through with the NLTK book and was thinking about starting such a project myself, w hen I saw that you already started a project for German. First of all, thanks for that. ^^ I'll still need to adjust it for German dialects anyway, as we are using many different ones in our German chat room. Also many smilies are missing, at least for us.)

CJAnti avatar Oct 25 '14 17:10 CJAnti

Thanks for further investigating the issue and contributing your results to the pattern project. They seem to be rather streched for resources. What I like about the pattern implementation is that it is pure python and its lemmatization is quite fast compared to other taggers. However, accuracy is a major problem. I've been working on textblob-rftagger for a while and the results are promising. It's not ready for public release yet, but if you contact me via email, I could invite you to the bitbucket-repo (if you're interested).

markuskiller avatar Nov 01 '14 08:11 markuskiller

By the way, do you know whether rftagger is open source?

mk270 avatar Jul 22 '15 17:07 mk270

@mk270 rftagger is open source and its source code is available under the following links:

project page: http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/

source code: http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/data/RFTagger.tar.gz

However, it is "freely available for education, research and other non-commercial purposes" only. If this is not a problem for your project, feel free to contact me via email for me to be able to invite you to the bitbucket repository of textblob-rftagger. It is fully functional and working on WIN/OSX & Linux. The only reason I'm holding it back is because I haven't had the time to sort out a sensible and secure way of distributing the included binaries and because I'm unsure about the licensing concerning these binaries.

markuskiller avatar Jul 22 '15 18:07 markuskiller

So, "available for education, research and other non-commercial purposes" is fairly canonically NOT open source, see for instance http://opensource.org/osd-annotated section 6.

It's a shame. I presume, since they've chosen to exclude commercial use, that they're not going to be biddable.

mk270 avatar Jul 22 '15 19:07 mk270

Thanks for the link. I interpreted 'open source' as 'is the source code available/accessible' (i.e. can it be modified/tweaked, etc.), which it is.

On other projects they released under a similarly restrictive license, they added:

"In order to use the TreeTagger commercially, you need to obtain a commercial license (see contact address below)! " (Source: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). So, I assume that they might be willing to make a decision/offer/quote on a per project basis and that they want to know what the project is about if you intend to use their software commercially.

markuskiller avatar Jul 22 '15 19:07 markuskiller

Yes, indeed, it's not remotely open source in that term's conventional acceptation - it's proprietary.

I am asking as it's a dependency of another project I'm interested in. Ah well.

mk270 avatar Jul 22 '15 19:07 mk270