spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

POS tags generated from en_core_web_sm differs from Universal POS tag set.

Open David-hg opened this issue 1 year ago • 1 comments

How to reproduce the behaviour

I'm trying to get the part-of-speach from some sentences to use it in a ML model. According to the documentation the possible values should be the same as in Universal POS tag set. By model restrictions, POS must be encoded to integers so I checked which values can be produced. After I processed some sentences with the en_core_web_sm model I found that there was a token that isn't in the list: 'SPACE'. I've obtained it when processing the sentence 'L-53 Now, we don't want to spend too much time on each one.'.This makes a contradiction with de docs. Is there a complete list with all the POS values that can be obtained from using this model?

With the code below the problem can be reproduced.

import spacy
nlp = spacy.load("en_core_web_sm")
sentence_processed = nlp('L-53 Now, we don\'t want to spend too much time on each one.')
print([x.pos_ for x in sentence_processed])

Your Environment

  • spaCy version: 3.4.0
  • Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.13
  • Pipelines: en_core_web_sm (3.4.0)

David-hg avatar Jul 28 '22 12:07 David-hg

All the possible tags in are spacy.parts_of_speech.IDS.

You're right that this isn't documented well on the token attributes page. Since spacy handles any kind of input text and includes token representations for all characters in the text including whitespace, and since UD doesn't have any specification for this, spacy has always had the extra POS SPACE. It's usually just for space tokens like \t or \n\n. The best match in UPOS is probably X, but since users often want to distinguish space tokens from non-space tokens, spacy has used SPACE.

All the POS labels that the en_core_web_* models can assign are part of the rules in the attribute_ruler component. You can see all the rules with nlp.get_pipe("attribute_ruler").patterns and look at the POS values in the attrs.

This is actually an interesting case where the tagger has made an error. In v3.4.0 we added more general whitespace augmentation to all the models, so it's possible for the tagger to predict a space tag for any token, however this is kind of surprising and I'll see if there's a more systematic error going on.

This is easy to update if you'd like, since it's just a rule-based conversion from tag->pos. There are two rules in the attribute_ruler component that lead to SPACE:

{'patterns': [[{'TAG': '_SP'}]], 'attrs': {'POS': 'SPACE', 'MORPH': '_'}, 'index': 0}

{'patterns': [[{'IS_SPACE': True}]], 'attrs': {'TAG': '_SP', 'POS': 'SPACE', 'MORPH': '_'}, 'index': 0}

If you change SPACE to X, then the tags should all be valid UPOS tags.

I think it would be useful for us to update the rules in the future so that it uses X instead of SPACE for errors like this, so rules more like:

{'patterns': [[{'TAG': '_SP', 'IS_SPACE': True}]], 'attrs': {'POS': 'SPACE', 'MORPH': '_'}, 'index': 0}

{'patterns': [[{'TAG': '_SP', 'IS_SPACE': False}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}

{'patterns': [[{'IS_SPACE': True}]], 'attrs': {'TAG': '_SP', 'POS': 'SPACE', 'MORPH': '_'}, 'index': 0}

adrianeboyd avatar Aug 02 '22 07:08 adrianeboyd

I've added rules to convert SPACE -> X for non-space tokens that we will tentatively plan to use in the v3.5.0 model releases.

adrianeboyd avatar Sep 12 '22 12:09 adrianeboyd

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] avatar Sep 20 '22 00:09 github-actions[bot]

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Oct 21 '22 00:10 github-actions[bot]