stanza
stanza copied to clipboard
[QUESTION] Getting a tagset for a stanza model
I'd like to, preferably programmatically as in Spacy, get tagsets for all processors for a given model/language in Stanza.
E.g. all NER tags, all dependency relations. (I know that NER tags are included in the documentation.) I need to know what are the possible labels returned by Stanza and be able to update it easily.
Is it possible to get such information for a specific Stanza model?
Programmatically, it is easiest to get the tags from the NER. Once the pipeline is created, you can do this:
pipe.processors['ner'].get_known_tags()
The depparse is possible. I'll add a function to make it a bit easier to access this.
pipe.processors['depparse'].vocab['deprel']._unit2id.keys()
Just to be clear, though, you cannot "update" any of those models easily... you can always retrain them with new labels or whatever, though.
Many thanks!
To sum up, all the vocabs I want to extract:
'ner', nlp.processors['ner'].get_known_tags()
'deprels', list(pipe.processors['depparse'].vocab['deprel']._unit2id.keys())
'upos', list(pipe.processors['pos'].vocab['upos']._unit2id.keys())
'xpos', list(pipe.processors['pos'].vocab['xpos']._unit2id.items())
'feats', list(pipe.processors['pos'].vocab['feats']._unit2id.items())
I also see that I need to omit the meaningless tags - <PAD>, <UNK>, <EMPTY>.
Thank you for extrapolating the other three on your own; this has to be one of the least painful support experiences I've had recently :)
- I have added convenience methods to the dev branch for
pos_processoras well: https://github.com/stanfordnlp/stanza/commit/fd90257ab8def962e4622e500d5bde77a964a057 - There is also a
<ROOT>"meaningless tag". You may simply want to eliminate everything fromVOCAB_PREFIX, as in the change I just made.
Thank you! For me, there code above is enough. However, I see two issues related to the auxiliary methods you added.
get_known_featsdoes not return values for specific morph featuresget_known_xposdoes not behave well e.g. for Polish because of the dictionary structure:
nlp = stanza.Pipeline('pl'); nlp.processors['pos'].vocab['xpos']._unit2id
{
0: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'subst': 4, 'prep': 5, 'fin': 6, 'conj': 7, 'interp': 8, 'adj': 9, 'num': 10, 'pcon': 11, 'ppron3': 12, 'adv': 13, 'pact': 14, 'part': 15, 'siebie': 16, 'ppas': 17, 'praet': 18, 'comp': 19, 'adja': 20, 'ger': 21, 'inf': 22, 'pred': 23, 'adjp': 24, 'aglt': 25, 'impt': 26, 'pant': 27, 'frag': 28, 'ppron12': 29, 'bedzie': 30, 'winien': 31, 'imps': 32, 'brev': 33, 'interj': 34, 'depr': 35, 'dig': 36, 'adjc': 37, 'emo': 38, 'ign': 39, 'romandig': 40},
1: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'sg': 4, 'loc': 5, 'inst': 6, 'pl': 7, 'acc': 8, 'gen': 9, 'imperf': 10, 'pos': 11, 'dat': 12, 'com': 13, 'perf': 14, 'pun': 15, 'sup': 16, 'npun': 17, 'nom': 18, 'wok': 19},
2: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'nom': 4, 'nwok': 5, 'loc': 6, 'inst': 7, 'ter': 8, 'acc': 9, 'gen': 10, 'wok': 11, 'f': 12, 'dat': 13, 'n': 14, 'm1': 15, 'm3': 16, 'pri': 17, 'sec': 18, 'm2': 19, 'voc': 20},
3: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'm1': 4, 'f': 5, 'm3': 6, 'imperf': 7, 'n': 8, 'm2': 9, 'perf': 10},
4: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'pos': 4, 'ncol': 5, 'rec': 6, 'ter': 7, 'imperf': 8, 'col': 9, 'congr': 10, 'perf': 11, 'pt': 12, 'com': 13, 'wok': 14, 'nwok': 15, 'pri': 16, 'sec': 17, 'sup': 18, 'nagl': 19, 'agl': 20},
5: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'akc': 4, 'aff': 5, 'col': 6, 'ncol': 7, 'nakc': 8, 'neg': 9},
6: {'<PAD>': 0, '<UNK>': 1, '<EMPTY>': 2, '<ROOT>': 3, 'praep': 4, 'npraep': 5}
}
In both cases I cope with it.
get_known_feats does not return values for specific morph features
What would you like to see? Instead of
['Abbr', 'Case', 'Definite', 'Degree', 'ExtPos', 'Foreign', 'Gender', 'Mood', 'NumForm', 'NumType', 'Number', 'Person', 'Polarity', 'Poss', 'PronType', 'Reflex', 'Style', 'Tense', 'Typo', 'VerbForm', 'Voice']
you would want items like this?
Number=Sing
get_known_xpos does not behave well e.g. for Polish because of the dictionary structure:
Sorry. Polish is not a language I work with much, and POS is not a model I work with much, so hopefully this oversight is understandable.
What would be useful in this case? I am not sure putting each of the maps into one big bucket would be valuable.
What would you like to see?
I added my version in #1073
What would be useful in this case? I am not sure putting each of the maps into one big bucket would be valuable. For Polish I will need only the first mapping. Although it's not a general solution.
The problem is that now get_known_xpos gives different results for different languages (instead of specific tagset for these languages):
xposes = dict()
for lang in ['en', 'fr', 'de', 'pl']:
nlp = stanza.Pipeline(lang)
xposes[lang] = nlp.processors['pos'].get_known_xpos()
results in:
{'en': ['NN', 'IN', 'DT', 'NNP', 'JJ', 'PRP', '.', 'RB', 'NNS', ',', 'VB', 'CC', 'VBD', 'VBP', 'VBZ', 'VBN', 'CD', 'VBG', 'TO', 'PRP$', 'MD', ':', '-RRB-', '-LRB-', 'WDT', '``', "''", 'WRB', 'UH', 'HYPH', 'RP', 'WP', 'POS', 'NNPS', 'JJR', 'JJS', 'RBR', 'EX', 'SYM', 'FW', 'NFP', 'PDT', 'ADD', 'GW', '$', 'RBS', 'LS', 'AFX', 'WP$', 'XX'],
'fr': ['Gender=Fem|Number=Plur', 'Gender=Masc|Number=Sing'],
'de': ['NN', 'ART', 'APPR', 'NE', 'ADJA', '$.', '$,', 'VVFIN', 'ADV', '$(', 'VAFIN', 'KON', 'CARD', 'ADJD', 'PPER', 'VVPP', 'PPOSAT', 'VVINF', 'PRELS', 'KOKOM', 'KOUS', 'PRF', 'PTKVZ', 'PAV', 'PIAT', 'VMFIN', 'PIS', 'PDAT', 'PTKZU', 'PTKNEG', 'FM', 'PDS', 'VAINF', 'TRUNC', 'PWAV', 'KOUI', 'VVIZU', 'XY', 'VAPP', 'PWS', 'PRELAT', 'VMINF', 'PTKA', 'APZR', 'VVIMP', 'APPO', 'PWAT', 'APPRART', 'ITJ', 'PTKANT', 'PPOSS'],
'pl': [0, 1, 2, 3, 4, 5, 6]}
German and English looks good, French and Polish don't give tagsets. I understand why it happens for Polish. Integer results are fine.
However, for French I think that Stanza does not return xpos at all (dataset). I wonder where ['Gender=Fem|Number=Plur', 'Gender=Masc|Number=Sing'] comes from. I doubt it's meaningful result.
For the French GSD model, there are no xpos tags in the dataset, so that is normal behavior. The features are labeled in that dataset and are predicted using the output from the upos tagger.
Thanks for the change regarding the features! I have integrated that.
What about returning a list of sets for the nested tags? You can get the first set in that case. I suppose that brings up an interface question of deciding if a non-nested set of xpos tags should have a list of 1 set or just be the set itself.
What about returning a list of sets for the nested tags? You can get the first set in that case. I suppose that brings up an interface question of deciding if a non-nested set of xpos tags should have a list of 1 set or just be the set itself.
I think it's enough the way it is now. At least for me, I have everything I needed. When the returned xpos tags are integers I know they are just nested tags. It's fine.
However, I still think that behaviour for French:
stanza.Pipeline('fr').processors['pos'].get_known_xpos()
is undesirable.
That is a data issue:
https://github.com/UniversalDependencies/UD_French-GSD/issues/14
Hopefully something like this will make the composite vocabs (PL for example) a bit more useful
https://github.com/stanfordnlp/stanza/commit/36b84db71f19e37b36119e2ec63f89d1e509acb0
I retrained the POS model now that the data issue is fixed in GSD. It no longer has the stray features as POS tags. You will need the dev branch to access this
What about constituency parser? How can I get all available tags for the constituency parser?
Easy peasy. It's on the dev branch now
This is now released.