spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Add way to lower in rule_lemmatizer

Open jademlc opened this issue 3 years ago • 6 comments

The special lemmatization will lowercase the whole text, even proper nouns

Description

Added a function to define a special_lemmatization that returns text.lower() even when the token is detected as a proper nouns. This change is useful when using spaCy Matchers with noisy texts containing a lot of uppercases. This was discussed in this issue: https://github.com/explosion/spaCy/discussions/11051 Using the special lemmatizer, the lemma of 'CAT' will be 'cat' instead of 'CAT'.

Types of change

This change is a enhancement of the lemmatizer.

Checklist

  • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [x] I ran the tests, and all new and existing tests passed.
  • [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

jademlc avatar Sep 07 '22 14:09 jademlc

This could be very useful! thank you. I hope it will get merged soon.

databill86 avatar Sep 08 '22 06:09 databill86

Very interesting, and should be merged after passing all tests.

hetpin avatar Sep 08 '22 09:09 hetpin

There are a number of ways to approach this, so it's useful to hear from everyone who is interested in this!

Can I ask some questions about what kind of output you're looking for?

Currently, you have situations like this where tagging inconsistencies between common and proper nouns make it hard to match on lemmas:

cat/NOUN/Number=Sing -> cat cats/NOUN/Number=Plur -> cat Cat/NOUN/Number=Sing -> cat Cat/PROPN/Number=Sing -> Cat Cats/NOUN/Number=Plur -> cat Cats/PROPN/Number=Plur -> Cats CAT/NOUN/Number=Sing -> cat CAT/PROPN/Number=Sing -> CAT CATS/NOUN/Number=Plur -> cat CATS/PROPN/Number=Plur -> CATS

This proposal would lowercase the uppercase lemmas for proper nouns but not make any other changes, so you'd end up with:

Cat/PROPN/Number=Sing -> cat Cats/PROPN/Number=Plur -> cats CAT/PROPN/Number=Sing -> cat CATS/PROPN/Number=Plur -> cats

Which would still mean you'd be looking at matching both cat and cats as lemmas in some cases.

I could imagine options that both lowercase and then treat proper nouns as common nouns to get cat in all cases, but it could be that this isn't what anyone is actually looking for? There are obviously tricky cases like "Cats are soft. Cats is a musical." since the tagger is unlikely to be very accurate.

I'm interested to hear the details about your hands-on practical problems with matching on lemmas?

adrianeboyd avatar Sep 08 '22 15:09 adrianeboyd

Thanks for your answer.

We've offered this solution to decrease the number of potential lemmas we would need to put on our Matchers, so now instead multiple forms we would need to add two. We are dealing with a lot of noisy data containing a lot of uppercases even when they are not needed and need to have Matchers that work on them with minimum patterns. That's why for us treating proper nouns as nouns would be particularly useful. If you tell us that it is something possible, we would love to hear the solution for that. We are aware that this could cause issues like the one you described but we would rather have these issues than not matching something we should.

jademlc avatar Sep 09 '22 08:09 jademlc

If you want to treat proper nouns as nouns and you don't care whether you preserve the original POS tags in the doc, you can modify the POS before lemmatization by adding rules in the attribute ruler.

import spacy
from spacy.tokens import Doc

nlp = spacy.load("en_core_web_sm")
ruler = nlp.get_pipe("attribute_ruler")
patterns = [[{"TAG": {"IN": ["NNP", "NNPS"]}}]]
attrs = {"POS": "NOUN"}
ruler.add(patterns=patterns, attrs=attrs)
doc = Doc(nlp.vocab, words=["cats"], tags=["NNPS"])
print([(t.tag_, t.morph, t.pos_, t.lemma_) for t in nlp(doc)])

To be honest the singular/plural distinction in the trained models for all caps tokens is not great, so I can't come up with a good short example here that is actually tagged by the model as NNPS to show that this works, so I created a doc by hand above. But by converting PROPN to NOUN you should at least get lowercase lemmas everywhere even if it does include both cat and cats.

A kind of brute force plural treatment for PROPN tokens that end in s could look like this:

nlp = spacy.load("en_core_web_sm")
ruler = nlp.get_pipe("attribute_ruler")
# mark all proper nouns that end in s as plural common nouns
patterns = [[{"LOWER": {"REGEX": "s$"}, "TAG": {"IN": ["NNP", "NNPS"]}}]]
attrs = {"POS": "NOUN", "MORPH": "Number=Plur"}
ruler.add(patterns=patterns, attrs=attrs)
# mark any remaining proper nouns as common nouns
patterns = [[{"TAG": {"IN": ["NNP", "NNPS"]}}]]
attrs = {"POS": "NOUN"}
ruler.add(patterns=patterns, attrs=attrs)
doc = Doc(nlp.vocab, words=["cats"], tags=["NNPS"])
print([(t.tag_, t.morph, t.pos_, t.lemma_) for t in nlp(doc)])

adrianeboyd avatar Sep 13 '22 08:09 adrianeboyd

Thanks for the different options.

We tested them and unfortunately it won't be the best option for our usecase as we need the POS-tag information of proper nouns. We thought it would be possible to have something in the code to treat proper nouns the same as nouns without changing the POS-tag.

In conclusion, the best solution for us is to change the lemmatizer so we can lowercase the proper nouns to allow us to catch them with our Matchers patterns.

jademlc avatar Sep 20 '22 10:09 jademlc

Thanks for the details! In that case, my suggestion would be to have a custom component that lowercases all lemmas. Obviously a custom lemmatizer is still an option, but another quick solution is to add a custom component that just lowercases lemmas:

import spacy
from spacy import Language
from spacy.tokens import Doc


@Language.component("lowercase_lemmas")
def lowercase_lemmas(doc: Doc) -> Doc:
    for token in doc:
        token.lemma_ = token.lemma_.lower()
    return doc


nlp = spacy.load("en_core_web_sm")
print([(t.pos_, t.lemma_) for t in nlp("ABBR Abbr abbr")])
nlp.add_pipe("lowercase_lemmas")
print([(t.pos_, t.lemma_) for t in nlp("ABBR Abbr abbr")])

We discussed this proposal internally and decided that we'd prefer to not to add this as additional lemmatizer mode in the core library. There are just so many possible options and preferences here that it becomes unwieldy if we try to support all of them in the core library. Our suggestion is to rely on custom components like a custom lemmatizer or like the small custom component above for tasks like this.

If you do choose to implement a custom lemmatizer mode, I think you'd also need to add the tables for the new mode to get_lookups_config. (Sorry, the API is indeed a bit clunky, but I was trying to make it possible to add new modes without touching the initialization code or the to/from_bytes/disk methods.)

Thanks again for the PR! We also have a detailed lemmatizer FAQ coming soon so the customization options here will be easier to figure out.

adrianeboyd avatar Sep 29 '22 12:09 adrianeboyd

The lemmatizer FAQ we mentioned before is now available: https://github.com/explosion/spaCy/discussions/11685

We think the options described there should be a good starting point for most customizations, but if you still don't understand how something works feel free to open a discussion about how we can improve things.

polm avatar Oct 21 '22 04:10 polm