dkpro-core icon indicating copy to clipboard operation
dkpro-core copied to clipboard

#1454 - German OpenNLP chunker model

Open aggarwalpiush opened this issue 4 years ago • 19 comments

PR is created

  • to implement the German chunker model for OpenNLP (request raised in issue #1454).
  • to fix existing bugs available
    • update ixa en perceptron model
    • update ixa es perceptron model
    • remove ixa es pos maxent model

Kindly review.

aggarwalpiush avatar Mar 03 '20 00:03 aggarwalpiush

Can one of the admins verify this patch?

ukp-svc-jenkins avatar Mar 03 '20 00:03 ukp-svc-jenkins

@aggarwalpiush I'm looking into the PR - and also trying to add a few additional IXA models in the process.

In particular your chunker model is giving problems though - because of the SSL certificate used on your webserver. I get this error when trying to download your model using the build script:

javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

See also: https://www.sslshopper.com/ssl-checker.html#hostname=https://www.ltl.uni-due.de/content/6-software/de-chunker-opennlp.bin

reckart avatar Apr 05 '20 10:04 reckart

I had a look at the IXA models - these models that come from the morph-models-1.5.0 are not meant to be used with OpenNLP but rather with the IXA pipes from the IXA module we also have. So instead of updating the build scripts for the models from the morph-models-1.5.0 we should actually drop them from the OpenNLP module. They are already included in the IXA module.

reckart avatar Apr 05 '20 15:04 reckart

Cf. https://github.com/dkpro/dkpro-core/issues/1465

reckart avatar Apr 05 '20 15:04 reckart

@aggarwalpiush I'm looking into the PR - and also trying to add a few additional IXA models in the process.

In particular your chunker model is giving problems though - because of the SSL certificate used on your webserver. I get this error when trying to download your model using the build script:

javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

See also: https://www.sslshopper.com/ssl-checker.html#hostname=https://www.ltl.uni-due.de/content/6-software/de-chunker-opennlp.bin

@reckart I have added the missing certificates to LTL webserver. Could you please try to download models using build script from the webserver.

aggarwalpiush avatar Apr 08 '20 15:04 aggarwalpiush

Cf. #1465

I can see that in this issue, ixa models are already removed from the build script. If everything works now, can we merge the changes?

aggarwalpiush avatar Apr 08 '20 15:04 aggarwalpiush

I have updated this PR with a couple of changes - please have a look.

I assume the POS tags used to train the model were of the STTS tagset?

Are the chunk tags a part of the TIGER corpus? Where are they documented? I didn't find a documentation of them in the syntax annotation guidelines.

reckart avatar Apr 15 '20 06:04 reckart

I cannot upload the model still because the maven-ant-task tries to access Maven Central in the process and does so via http - but http is no longer supported, only https. We either need to figure out how to reconfigure it to use https or maybe switch to another newer ant maven task because the one we currently use is deprecated.

reckart avatar Apr 15 '20 06:04 reckart

I have updated this PR with a couple of changes - please have a look.

@reckart changes looks good to me

I assume the POS tags used to train the model were of the STTS tagset?

Are the chunk tags a part of the TIGER corpus? Where are they documented? I didn't find a documentation of them in the syntax annotation guidelines.

@mariebexte could you please provide these details

aggarwalpiush avatar Apr 15 '20 08:04 aggarwalpiush

I cannot upload the model still because the maven-ant-task tries to access Maven Central in the process and does so via http - but http is no longer supported, only https. We either need to figure out how to reconfigure it to use https or maybe switch to another newer ant maven task because the one we currently use is deprecated.

As I don't have issue management read access for apache ant task, I believe, from release 2.0.7, this issue is resolved in bug MANTTASKS-11. Can we check, if this release is the right version that solves this issue?

aggarwalpiush avatar Apr 15 '20 09:04 aggarwalpiush

I have seen MANTTASKS-11 but I wasn't able yet to look how to actually configure an alternative URL for maven central. And then I noticed that the tasks we use is outdated anyway and that https://maven.apache.org/resolver-ant-tasks/ is the new replacement - so I wasn't sure whether investigating fixing the setup with the old tasks is worth it.

Do you want to investigate?

reckart avatar Apr 15 '20 09:04 reckart

I assume the POS tags used to train the model were of the STTS tagset? Are the chunk tags a part of the TIGER corpus? Where are they documented? I didn't find a documentation of them in the syntax annotation guidelines.

@mariebexte could you please provide these details

POS tags are STSS.

As for the chunk tags, these are part of the TIGER corpus. The pdf including their documentation comes with downloading the corpus, but can also be found online here.

mariebexte avatar Apr 15 '20 10:04 mariebexte

@mariebexte I didn't read the full documentation - I only searched for "chunk" and that was the only sentence I found:

Fremdsprachliche Zitate werden als Chunks (CH) flach annotiert; die einzelnen Komponenten erhalten das Label UC (“unit component”).

It seems to me that the chunks are some kind of projection of the phrase categories to the word level that you did yourself?

reckart avatar Apr 15 '20 10:04 reckart

It seems to me that the chunks are some kind of projection of the phrase categories to the word level that you did yourself?

Yes, you're right. We used NPs, VPs and PPs for the respective chunks.

This meant giving each token a B-[NP|VP|PP] (beginning of chunk), I-[NP|VP|PP] (continuation of chunk) or O (not part of a chunk) annotation, derived from the NP, VP and PP annotations in TIGER. So, if TIGER annotated two tokens A and B as a NP, we would annotate A as the beginnig of the chunk (B-NP) and B as I-NP.

mariebexte avatar Apr 15 '20 11:04 mariebexte

@mariebexte I added a unit test with your model. Here are the results:

Text: Wir brauchen ein sehr kompliziertes Beispiel , welches möglichst viele Konstituenten und Dependenzen beinhaltet .

                "[  0,  3]NC(NP) (Wir)",
                "[  4, 12]VC(VP) (brauchen)",
                "[ 13, 16]NC(NP) (ein)",
                "[ 36, 44]NC(NP) (Beispiel)",
                "[ 47, 54]NC(NP) (welches)",
                "[ 55, 64]VC(VP) (möglichst)",
                "[ 65, 70]NC(NP) (viele)",
                "[ 71, 84]NC(NP) (Konstituenten)",
                "[ 89,100]NC(NP) (Dependenzen)",
                "[101,111]VC(VP) (beinhaltet)"

It would seem as if something with the BIO encoding went wrong during training because the chunks all appear to be single-word chunks. Also, not all words are included in the a chunk? Normally, a chunk consisting as multiple words should be returned as a larger span from the OpenNLP chunker. For comparison, here an example from another model (en, perceptron-ixa):

Text: We need a very complicated example sentence, which contains as many constituents and dependencies as possible.


                "[  0,  2]NC(NP) (We)",
                "[  3,  7]VC(VP) (need)",
                "[  8, 43]NC(NP) (a very complicated example sentence)",
                "[ 45, 50]NC(NP) (which)",
                "[ 51, 59]VC(VP) (contains)",
                "[ 60, 62]O(SBAR) (as)",
                "[ 63, 97]NC(NP) (many constituents and dependencies)",
                "[ 98,100]PC(PP) (as)",
                "[101,109]ADJC(ADJP) (possible)"

reckart avatar Apr 15 '20 11:04 reckart

It would seem as if something with the BIO encoding went wrong during training because the chunks all appear to be single-word chunks. Also, not all words are included in the a chunk?

When I chunk the same sentence in the command line (using opennlp POSTagger de-pos-maxent and then chunking with opennlp ChunkerME and the model) tokens that are not part of a chunk are returned:

[NP Wir_PPER ]
[VP brauchen_VVFIN ]
[NP ein_ART ]
sehr_ADV
kompliziertes_ADJA
[NP Beispiel_NN ]
,_$,
[NP welches_PRELS ]
[VP möglichst_VVFIN ]
[NP viele_PIAT ]
[NP Konstituenten_NN ]
und_KON
[NP Dependenzen_NN ]
[VP beinhaltet_VVFIN ]
._$.

I agree that it is not desired to end up with this many single-word chunks, so I'll have to dig into TIGER to see whether that's an issue caused by how it was annotated or if something went wrong with our BIO-tags. In general, the model is capable of returning multi-word chunks:

Text: Wir brauchen kein einfaches Beispiel .
[NP Wir_PPER ] 
[VP brauchen_VVFIN ] 
[NP kein_PIAT einfaches_ADJA Beispiel_NN ] 
._$.

mariebexte avatar Apr 15 '20 22:04 mariebexte

Do you want to provide an updated model or should it be merged as it is?

reckart avatar Apr 26 '20 20:04 reckart

Sorry for not getting back to you earlier.

I am afraid the results we discussed are due to how phrases are annotated in TIGER, hence I won‘t be able to provide an updated model.

mariebexte avatar Apr 27 '20 10:04 mariebexte

Jenkins, can you test this please?

reckart avatar Apr 27 '20 18:04 reckart