stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Tagging errors for verbal copulas

Open muchang opened this issue 4 years ago • 2 comments

Describe the bug For these two sentences:

The first challenge that we have before we can do any kind of analysis of these interstellar dust particles is to find them.
But it is that a word can have just any vocal sound .

The word is in both sentences should be tagged as AUX, since they are both verbal copulas. According to Universal Dependencies Standard for VERB and AUX, VERB does not cover AUX.

Note that the VERB tag covers main verbs (content verbs) but it does not cover auxiliary verbs and verbal copulas (in the narrow sense), for which there is the AUX tag.

However, Stanza tags is as VERB.

To Reproduce

$ python3 test1.py 
2021-06-26 15:11:56 INFO: Loading these models for language: en (English):
========================
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
========================

2021-06-26 15:11:56 INFO: Use device: cpu
2021-06-26 15:11:56 INFO: Loading: tokenize
2021-06-26 15:11:56 INFO: Loading: pos
2021-06-26 15:11:56 INFO: Done loading processors!
The None DET
first None ADJ
challenge None NOUN
that None PRON
we None PRON
have None VERB
before None SCONJ
we None PRON
can None AUX
do None VERB
any None DET
kind None NOUN
of None ADP
analysis None NOUN
of None ADP
these None DET
interstellar None ADJ
dust None NOUN
particles None NOUN
is None VERB
to None PART
find None VERB
them None PRON
. None PUNCT

$ cat test1.py 
import stanza

#stanza.download('en')
nlp = stanza.Pipeline('en',processors='tokenize,pos',tokenize_pretokenized=True)
doc = nlp("The first challenge that we have before we can do any kind of analysis of these interstellar dust particles is to find them .")

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos)
$ python3 test2.py 
2021-06-26 15:13:09 INFO: Loading these models for language: en (English):
========================
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
========================

2021-06-26 15:13:09 INFO: Use device: cpu
2021-06-26 15:13:09 INFO: Loading: tokenize
2021-06-26 15:13:09 INFO: Loading: pos
2021-06-26 15:13:09 INFO: Done loading processors!
But None CCONJ
it None PRON
is None VERB
that None SCONJ
a None DET
word None NOUN
can None AUX
have None VERB
just None ADV
any None DET
vocal None ADJ
sound None NOUN
. None PUNCT

$ cat test2.py 
import stanza

#stanza.download('en')
nlp = stanza.Pipeline('en',processors='tokenize,pos',tokenize_pretokenized=True)
doc = nlp("But it is that a word can have just any vocal sound .")

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos)

Expected behavior is should be tagged as AUX.

Environment (please complete the following information):

  • OS: MacOS
  • Python version: Python 3.9.4
  • Stanza version: 68aa42653d656f6131ec14837d5f99927ab17d02/1.2.1

Additional context I notice that there are quite some incorrect labels (nearly 300 items) in the corpus UD_English-GUM that tag verbal copulas be as VERB, which might be the root cause. Do you think so?

muchang avatar Jun 26 '21 08:06 muchang

It looks like there's quite a few in EWT as well. EWT has 284, GUM has 86. In GUM it seems that "there is" is a common use of is_VBZ. In EWT there are many more is_VBZ constructions, and many of them appear incorrect. I'll file an issue there.

On Sat, Jun 26, 2021 at 1:31 AM Chengyu Zhang @.***> wrote:

Describe the bug For these two sentences:

The first challenge that we have before we can do any kind of analysis of these interstellar dust particles is to find them.

But it is that a word can have just any vocal sound .

The word is in both sentences should be tagged as AUX, since they are both verbal copulas. According to Universal Dependencies Standard for VERB https://universaldependencies.org/u/pos/VERB.html and AUX https://universaldependencies.org/u/pos/AUX.html, VERB does not cover AUX.

Note that the VERB tag covers main verbs (content verbs) but it does not cover auxiliary verbs and verbal copulas (in the narrow sense), for which there is the AUX tag.

However, Stanza tags is as VERB.

To Reproduce

$ python3 test1.py 2021-06-26 15:11:56 INFO: Loading these models for language: en (English):

| Processor | Package |

| tokenize | combined | | pos | combined |

2021-06-26 15:11:56 INFO: Use device: cpu 2021-06-26 15:11:56 INFO: Loading: tokenize 2021-06-26 15:11:56 INFO: Loading: pos 2021-06-26 15:11:56 INFO: Done loading processors! The None DET first None ADJ challenge None NOUN that None PRON we None PRON have None VERB before None SCONJ we None PRON can None AUX do None VERB any None DET kind None NOUN of None ADP analysis None NOUN of None ADP these None DET interstellar None ADJ dust None NOUN particles None NOUN is None VERB to None PART find None VERB them None PRON . None PUNCT

$ cat test1.py import stanza

#stanza.download('en') nlp = stanza.Pipeline('en',processors='tokenize,pos',tokenize_pretokenized=True) doc = nlp("The first challenge that we have before we can do any kind of analysis of these interstellar dust particles is to find them .")

for sentence in doc.sentences: for word in sentence.words: print(word.text, word.lemma, word.pos)

$ python3 test2.py 2021-06-26 15:13:09 INFO: Loading these models for language: en (English):

| Processor | Package |

| tokenize | combined | | pos | combined |

2021-06-26 15:13:09 INFO: Use device: cpu 2021-06-26 15:13:09 INFO: Loading: tokenize 2021-06-26 15:13:09 INFO: Loading: pos 2021-06-26 15:13:09 INFO: Done loading processors! But None CCONJ it None PRON is None VERB that None SCONJ a None DET word None NOUN can None AUX have None VERB just None ADV any None DET vocal None ADJ sound None NOUN . None PUNCT

$ cat test2.py import stanza

#stanza.download('en') nlp = stanza.Pipeline('en',processors='tokenize,pos',tokenize_pretokenized=True) doc = nlp("But it is that a word can have just any vocal sound .")

for sentence in doc.sentences: for word in sentence.words: print(word.text, word.lemma, word.pos)

Expected behavior is should be tagged as AUX.

Environment (please complete the following information):

Additional context I notice that there are quite some incorrect labels (nearly 300 items) in the corpus UD_English-GUM https://github.com/UniversalDependencies/UD_English-GUM that tag verbal copulas be as VERB, which might be the root cause. Do you think so?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOI6TWXK5C32QQU4H3TUWF6NANCNFSM47LECRLA .

AngledLuffa avatar Jun 26 '21 23:06 AngledLuffa

Thanks, John. Sounds good.

muchang avatar Jun 27 '21 04:06 muchang

Somewhere along the way, the data improvements to the original treebanks and/or adding the charlm to the POS results in the latest models getting AUX instead of VERB for both of those examples.

AngledLuffa avatar Oct 03 '23 07:10 AngledLuffa