stanza
stanza copied to clipboard
POS tagging's unexpected result on ADJ word
Describe the bug
For this sentence,
He also designed furniture and houses for the wealthy.
wealthy is a adjective that exceptionally head a nominal phrase, which should be tagged as ADJ according to Universal Dependencies
On the other hand, adjectives that exceptionally head a nominal phrase (as in the sick, the healthy) are still tagged ADJ.
To Reproduce
$ python3 test.py
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.2.1.json: 139kB [00:00, 688kB/s]
2021-06-25 11:32:45 INFO: Downloading default packages for language: en (English)...
2021-06-25 11:32:54 INFO: Finished downloading models and saved to /Users/stanza_resources.
2021-06-25 11:32:54 INFO: Loading these models for language: en (English):
========================
| Processor | Package |
------------------------
| tokenize | combined |
| pos | combined |
========================
2021-06-25 11:32:54 INFO: Use device: cpu
2021-06-25 11:32:54 INFO: Loading: tokenize
2021-06-25 11:32:54 INFO: Loading: pos
2021-06-25 11:32:56 INFO: Done loading processors!
He None PRON
also None ADV
designed None VERB
furniture None NOUN
and None CCONJ
houses None NOUN
for None ADP
the None DET
wealthy None NOUN
. None PUNCT
$ cat test.py
import stanza
stanza.download('en')
nlp = stanza.Pipeline('en',processors='tokenize,pos')
doc = nlp('He also designed furniture and houses for the wealthy.')
for sentence in doc.sentences:
for word in sentence.words:
print(word.text, word.lemma, word.pos)
Expected behavior
wealthy should be tagged as ADJ.
Environment (please complete the following information):
- OS: MacOS
- Python version: Python 3.9.4
- Stanza version: 68aa42653d656f6131ec14837d5f99927ab17d02/1.2.1
The primary problem is there are no examples of "wealthy" in such a context in the training data. However, there is an example of healthy/sick which is incorrectly tagged. I'll file an issue on the UD github.
# sent_id = newsgroup-groups.google.com_herpesnation_c74170a0fcfdc880_ENG_20051125_075200-0012
# text = When the healthy treat the sick with scorn and intolerance it brings us all down.
1 When when SCONJ WRB PronType=Int 4 mark 4:mark _
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _
3 healthy healthy ADJ JJ Degree=Pos 4 nsubj 4:nsubj _
4 treat treat VERB VBP Mood=Ind|Tense=Pres|VerbForm=Fin 12 advcl 12:advcl:when _
5 the the DET DT Definite=Def|PronType=Art 6 det 6:det _
6 sick sick NOUN NN Number=Sing 4 obj 4:obj _
7 with with ADP IN _ 8 case 8:case _
8 scorn scorn NOUN NN Number=Sing 4 obl 4:obl:with _
9 and and CCONJ CC _ 10 cc 10:cc _
10 intolerance intolerance NOUN NN Number=Sing 8 conj 4:obl:with|8:conj:and _
11 it it PRON PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs 12 nsubj 12:nsubj _
12 brings bring VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _
13 us we PRON PRP Case=Acc|Number=Plur|Person=1|PronType=Prs 12 obj 12:obj _
14 all all DET DT _ 13 det 13:det _
15 down down ADV RB _ 12 advmod 12:advmod SpaceAfter=No
16 . . PUNCT . _ 12 punct 12:punct _
Thanks, John.
BTW, although the healthy/sick is incorrectly labelled, the POS tagging works correctly if I change the wealthy to sick for this sentence:
$ python3 test.py
2021-06-25 14:30:34 INFO: Loading these models for language: en (English):
========================
| Processor | Package |
------------------------
| tokenize | combined |
| pos | combined |
========================
2021-06-25 14:30:34 INFO: Use device: cpu
2021-06-25 14:30:34 INFO: Loading: tokenize
2021-06-25 14:30:34 INFO: Loading: pos
2021-06-25 14:30:34 INFO: Done loading processors!
He None PRON
also None ADV
designed None VERB
furniture None NOUN
and None CCONJ
houses None NOUN
for None ADP
the None DET
sick None ADJ
. None PUNCT
$ cat test.py
import stanza
nlp = stanza.Pipeline('en',processors='tokenize,pos')
doc = nlp('He also designed furniture and houses for the sick.')
for sentence in doc.sentences:
for word in sentence.words:
print(word.text, word.lemma, word.pos)
Which corpuses Stanza trained on besides UD_EWT? Perhaps healthy/sick is correctly labelled in other used corpuses.
yeah, good call. here's the problem, in GUM:
# sent_id = GUM_voyage_merida-18
# s_type = decl
# text = The wealthy constructed the grand Pasejo Montejo avenue north of the old town, inspired by the Champs-Élysées in Paris.
1 The the DET DT Definite=Def|PronType=Art 2 det 2:det Discourse=joint:25->24|Entity=(person-50
2 wealthy wealthy NOUN NNS Number=Plur 3 nsubj 3:nsubj Entity=person-50)
3 constructed construct VERB VBD Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin 0 root 0:root _
4 the the DET DT Definite=Def|PronType=Art 8 det 8:det Entity=(place-51-Paseo_de_Montejo
5 grand grand ADJ JJ Degree=Pos 8 amod 8:amod _
6 Pasejo Pasejo PROPN NNP Number=Sing 8 compound 8:compound _
7 Montejo Montejo PROPN NNP Number=Sing 6 flat 6:flat Entity=(person-32-Francisco_de_Montejo_the_Younger)
8 avenue avenue NOUN NN Number=Sing 3 obj 3:obj Entity=place-51-Paseo_de_Montejo)
9 north north ADV RB Degree=Pos 3 advmod 3:advmod _
10 of of ADP IN _ 13 case 13:case _
11 the the DET DT Definite=Def|PronType=Art 13 det 13:det Bridge=place-1-Mérida%2C_Yucatán<place-52-Mérida%2C_Yucatán|Entity=(place-52-Mérida%2C_Yucatán
12 old old ADJ JJ Degree=Pos 13 amod 13:amod _
13 town town NOUN NN Number=Sing 9 nmod 9:nmod:of Entity=place-52-Mérida%2C_Yucatán)|SpaceAfter=No
14 , , PUNCT , _ 15 punct 15:punct _
15 inspired inspire VERB VBN Tense=Past|VerbForm=Part 13 acl 13:acl Discourse=elaboration:26->25
16 by by ADP IN _ 18 case 18:case _
17 the the DET DT Definite=Def|PronType=Art 18 det 18:det Entity=(place-53-Champs-Élysées
18 Champs-Élysées Champs-Élysées PROPN NNP Number=Sing 15 obl 15:obl:by _
19 in in ADP IN _ 20 case 20:case _
20 Paris Paris PROPN NNP Number=Sing 15 obl 15:obl:in Entity=(place-54-Paris)place-53-Champs-Élysées)|SpaceAfter=No
21 . . PUNCT . _ 3 punct 3:punct _
Cool, John!
Unfortunately, when I retrained the model with the updated data, it still didn't get the correct answer. One possibility is to add more training data to improve the efficacy of the models, but it will be a little while before we do so.
Thanks, John. Nice to know. It may be due to the fact that there are no examples of "wealthy" in such a context in the training data as you said. Do you mind telling us the list of the corpora that Stanza trained on? We could help check out whether other corpora have the same issues and clean them up.
EWT, GUM, PUD, and Pronouns. I don't see any other examples of "wealthy" in any of those.
On Sat, Jun 26, 2021 at 1:45 AM Chengyu Zhang @.***> wrote:
Thanks, John. Nice to know. It may be due to the fact that there are no examples of "wealthy" in such a context in the training data as you said. Do you mind telling us the list of the corpora that Stanza trained on? We could help check out whether other corpora have the same issues and clean them up.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/730#issuecomment-868970495, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMGHBTQGZAS44CUHILTUWHUPANCNFSM47JDBFMQ .
I see, thank you!
The current models for 1.4.0 tag wealthy, healthy, sick, and poor correctly.
... and I think I can explain why it tags wealthy correctly now. We added GUMReddit to the list of training inputs, bringing the total number of instances of wealthy in our training data to 7, and by default it fine tunes words when they have 7 or more instances in the training data.
Thanks for your update. It's interesting!