stanza icon indicating copy to clipboard operation
stanza copied to clipboard

[QUESTION] Getting a tagset for a stanza model

Open k-sap opened this issue 3 years ago • 11 comments

I'd like to, preferably programmatically as in Spacy, get tagsets for all processors for a given model/language in Stanza.

E.g. all NER tags, all dependency relations. (I know that NER tags are included in the documentation.) I need to know what are the possible labels returned by Stanza and be able to update it easily.

Is it possible to get such information for a specific Stanza model?

k-sap avatar Jun 28 '22 13:06 k-sap

Programmatically, it is easiest to get the tags from the NER. Once the pipeline is created, you can do this:

pipe.processors['ner'].get_known_tags()

The depparse is possible. I'll add a function to make it a bit easier to access this.

pipe.processors['depparse'].vocab['deprel']._unit2id.keys()

Just to be clear, though, you cannot "update" any of those models easily... you can always retrain them with new labels or whatever, though.

AngledLuffa avatar Jun 28 '22 16:06 AngledLuffa

Many thanks!

To sum up, all the vocabs I want to extract:

'ner', nlp.processors['ner'].get_known_tags()
'deprels', list(pipe.processors['depparse'].vocab['deprel']._unit2id.keys())
'upos', list(pipe.processors['pos'].vocab['upos']._unit2id.keys())
'xpos', list(pipe.processors['pos'].vocab['xpos']._unit2id.items())
'feats', list(pipe.processors['pos'].vocab['feats']._unit2id.items())

I also see that I need to omit the meaningless tags - <PAD>, <UNK>, <EMPTY>.

k-sap avatar Jul 01 '22 15:07 k-sap

Thank you for extrapolating the other three on your own; this has to be one of the least painful support experiences I've had recently :)

  • I have added convenience methods to the dev branch for pos_processor as well: https://github.com/stanfordnlp/stanza/commit/fd90257ab8def962e4622e500d5bde77a964a057
  • There is also a <ROOT> "meaningless tag". You may simply want to eliminate everything from VOCAB_PREFIX, as in the change I just made.

AngledLuffa avatar Jul 02 '22 00:07 AngledLuffa

Thank you! For me, there code above is enough. However, I see two issues related to the auxiliary methods you added.

  • get_known_feats does not return values for specific morph features
  • get_known_xpos does not behave well e.g. for Polish because of the dictionary structure:
nlp = stanza.Pipeline('pl'); nlp.processors['pos'].vocab['xpos']._unit2id
{
0: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'subst': 4, 'prep': 5, 'fin': 6, 'conj': 7, 'interp': 8, 'adj': 9, 'num': 10, 'pcon': 11, 'ppron3': 12, 'adv': 13, 'pact': 14, 'part': 15, 'siebie': 16, 'ppas': 17, 'praet': 18, 'comp': 19, 'adja': 20, 'ger': 21, 'inf': 22, 'pred': 23, 'adjp': 24, 'aglt': 25, 'impt': 26, 'pant': 27, 'frag': 28, 'ppron12': 29, 'bedzie': 30, 'winien': 31, 'imps': 32, 'brev': 33, 'interj': 34, 'depr': 35, 'dig': 36, 'adjc': 37, 'emo': 38, 'ign': 39, 'romandig': 40}, 
1: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'sg': 4, 'loc': 5, 'inst': 6, 'pl': 7, 'acc': 8, 'gen': 9, 'imperf': 10, 'pos': 11, 'dat': 12, 'com': 13, 'perf': 14, 'pun': 15, 'sup': 16, 'npun': 17, 'nom': 18, 'wok': 19},
2: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'nom': 4, 'nwok': 5, 'loc': 6, 'inst': 7, 'ter': 8, 'acc': 9, 'gen': 10, 'wok': 11, 'f': 12, 'dat': 13, 'n': 14, 'm1': 15, 'm3': 16, 'pri': 17, 'sec': 18, 'm2': 19, 'voc': 20}, 
3: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'm1': 4, 'f': 5, 'm3': 6, 'imperf': 7, 'n': 8, 'm2': 9, 'perf': 10}, 
4: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'pos': 4, 'ncol': 5, 'rec': 6, 'ter': 7, 'imperf': 8, 'col': 9, 'congr': 10, 'perf': 11, 'pt': 12, 'com': 13, 'wok': 14, 'nwok': 15, 'pri': 16, 'sec': 17, 'sup': 18, 'nagl': 19, 'agl': 20}, 
5: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'akc': 4, 'aff': 5, 'col': 6, 'ncol': 7, 'nakc': 8, 'neg': 9}, 
6: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'praep': 4, 'npraep': 5}
}

In both cases I cope with it.

k-sap avatar Jul 05 '22 17:07 k-sap

get_known_feats does not return values for specific morph features

What would you like to see? Instead of

['Abbr', 'Case', 'Definite', 'Degree', 'ExtPos', 'Foreign', 'Gender', 'Mood', 'NumForm', 'NumType', 'Number', 'Person', 'Polarity', 'Poss', 'PronType', 'Reflex', 'Style', 'Tense', 'Typo', 'VerbForm', 'Voice']

you would want items like this?

Number=Sing

get_known_xpos does not behave well e.g. for Polish because of the dictionary structure:

Sorry. Polish is not a language I work with much, and POS is not a model I work with much, so hopefully this oversight is understandable.

What would be useful in this case? I am not sure putting each of the maps into one big bucket would be valuable.

AngledLuffa avatar Jul 05 '22 18:07 AngledLuffa

What would you like to see?

I added my version in #1073

What would be useful in this case? I am not sure putting each of the maps into one big bucket would be valuable. For Polish I will need only the first mapping. Although it's not a general solution.

The problem is that now get_known_xpos gives different results for different languages (instead of specific tagset for these languages):

xposes = dict()
for lang in ['en', 'fr', 'de', 'pl']:
    nlp = stanza.Pipeline(lang)
    xposes[lang] = nlp.processors['pos'].get_known_xpos()

results in:

{'en': ['NN', 'IN', 'DT', 'NNP', 'JJ', 'PRP', '.', 'RB', 'NNS', ',', 'VB', 'CC', 'VBD', 'VBP', 'VBZ', 'VBN', 'CD', 'VBG', 'TO', 'PRP$', 'MD', ':', '-RRB-', '-LRB-', 'WDT', '``', "''", 'WRB', 'UH', 'HYPH', 'RP', 'WP', 'POS', 'NNPS', 'JJR', 'JJS', 'RBR', 'EX', 'SYM', 'FW', 'NFP', 'PDT', 'ADD', 'GW', '$', 'RBS', 'LS', 'AFX', 'WP$', 'XX'],
'fr': ['Gender=Fem|Number=Plur', 'Gender=Masc|Number=Sing'],
'de': ['NN', 'ART', 'APPR', 'NE', 'ADJA', '$.', '$,', 'VVFIN', 'ADV', '$(', 'VAFIN', 'KON', 'CARD', 'ADJD', 'PPER', 'VVPP', 'PPOSAT', 'VVINF', 'PRELS', 'KOKOM', 'KOUS', 'PRF', 'PTKVZ', 'PAV', 'PIAT', 'VMFIN', 'PIS', 'PDAT', 'PTKZU', 'PTKNEG', 'FM', 'PDS', 'VAINF', 'TRUNC', 'PWAV', 'KOUI', 'VVIZU', 'XY', 'VAPP', 'PWS', 'PRELAT', 'VMINF', 'PTKA', 'APZR', 'VVIMP', 'APPO', 'PWAT', 'APPRART', 'ITJ', 'PTKANT', 'PPOSS'], 
'pl': [0, 1, 2, 3, 4, 5, 6]}

German and English looks good, French and Polish don't give tagsets. I understand why it happens for Polish. Integer results are fine.

However, for French I think that Stanza does not return xpos at all (dataset). I wonder where ['Gender=Fem|Number=Plur', 'Gender=Masc|Number=Sing'] comes from. I doubt it's meaningful result.

k-sap avatar Jul 06 '22 17:07 k-sap

For the French GSD model, there are no xpos tags in the dataset, so that is normal behavior. The features are labeled in that dataset and are predicted using the output from the upos tagger.

Thanks for the change regarding the features! I have integrated that.

What about returning a list of sets for the nested tags? You can get the first set in that case. I suppose that brings up an interface question of deciding if a non-nested set of xpos tags should have a list of 1 set or just be the set itself.

AngledLuffa avatar Jul 07 '22 00:07 AngledLuffa

What about returning a list of sets for the nested tags? You can get the first set in that case. I suppose that brings up an interface question of deciding if a non-nested set of xpos tags should have a list of 1 set or just be the set itself.

I think it's enough the way it is now. At least for me, I have everything I needed. When the returned xpos tags are integers I know they are just nested tags. It's fine.

However, I still think that behaviour for French:

stanza.Pipeline('fr').processors['pos'].get_known_xpos()

is undesirable.

k-sap avatar Aug 05 '22 08:08 k-sap

That is a data issue:

https://github.com/UniversalDependencies/UD_French-GSD/issues/14

AngledLuffa avatar Aug 05 '22 21:08 AngledLuffa

Hopefully something like this will make the composite vocabs (PL for example) a bit more useful

https://github.com/stanfordnlp/stanza/commit/36b84db71f19e37b36119e2ec63f89d1e509acb0

AngledLuffa avatar Aug 06 '22 02:08 AngledLuffa

I retrained the POS model now that the data issue is fixed in GSD. It no longer has the stray features as POS tags. You will need the dev branch to access this

AngledLuffa avatar Aug 07 '22 05:08 AngledLuffa

What about constituency parser? How can I get all available tags for the constituency parser?

MHDBST avatar Sep 08 '22 15:09 MHDBST

Easy peasy. It's on the dev branch now

AngledLuffa avatar Sep 08 '22 20:09 AngledLuffa

This is now released.

AngledLuffa avatar Sep 30 '22 05:09 AngledLuffa