spaCy
spaCy copied to clipboard
German adjectives ending on `-e` are not lemmatized using the lookup lemmatizer
How to reproduce the behaviour
import spacy
nlp = spacy.load('de')
s1 = 'Der schöne Garten'
doc = nlp(s1)
[(t, t.lemma_) for t in doc]
>> [(Der, 'der'), (schöne, 'schöne'), (Garten, 'Garten')]
s2 = 'Ein schöner Garten'
doc = nlp(s2)
[(t, t.lemma_) for t in doc]
>> [(Ein, 'Ein'), (schöner, 'schön'), (Garten, 'Garten')]
My Environment
- spaCy version: 2.2.2
- Platform: Linux-5.0.0-25-generic-x86_64-with-LinuxMint-19.2-tina
- Python version: 3.6.7
- Models: de
Reason
As far as I can see, all forms of German adjectives ending on e in spacy-lookups-data/spacy_lookups_data/data/de_lemma_lookup.json are capitalized, e.g.:
"Dekorative": "dekorativ",
"Weiße": "Weiß",
"Schöne": "Schönes",
The lookup tables, while sometimes better than nothing, are pretty terrible. They don't take any context into account and are very unpredictable / brittle. Many adjectives ending in -e are there, so it's all kind of strange. I'd recommend an alternate lemmatizer for German for now, see #2668 for some suggestions.
Hi @adrianeboyd, I've started with some tests today for a rule based lemmatizer and would like to propose a PR soon. Will we still maintain the lookup table afterwards? Do lookup table have precedence over the rule based lemmatizing? Or would all words that are already covered by a rule be removed from the lookup table to make it smaller?
A PR for this would be great! You might want to get in touch with Guadalupe Romero (@guadi1994), who has started working on this for Spanish and German.
The rule-based lemmatizer requires tags from the tagger, so the lookup table is used as a backup to use when no tags are available. The rules should have precedence over the table and I think that if there are rules, the lookup table is not used at all, but I might be mistaken.
Since it's used as a backup, it would probably make sense to fix some of the really weird closed class errors in the table, like "er" -> "ich". (We do have plans to add statisticals models for morphology and lemmatization, which could hopefully replace all of this, but it's all still in progress.)
You are right, the lookup is ignored as soon as there are rules. That means I can't have rules and enhance them gradually but have to develop rules and add all exceptions (and esp. for the nouns, there will be many) to the exceptions list. I'd also have to write an extra lemmatizing method because the standard lemmatize method would change all nouns to lower case, which won't work for German. I won't be able to do that in the next time but I'll try to fix the worst errors in the lookup table.
If someone can let me know the following:
Is this here still an issue? Where is the file referenced in the initial comment?
Then I'd like to take care of this issue.
lg
I don't think this has been addressed yet. The data is in this repo if you want to have a look at it.
Let me also link in this more recent issue about German lemmas: https://github.com/explosion/spaCy/issues/9799
Okay, at least the issue mentioned in this thread, I can't find it. The file has also been updated after this issue here has been opened.
Is there any way to confirm if this issue is still up to date? It appears that it can be closed but I can not tell for sure.
Okay, I think due to https://explosion.ai/blog/edit-tree-lemmatizer we could close this task or at least additional work would not make that much sense if lookup tables can be avoided
Yes, we're hoping to be able to include the edit tree lemmatizer in an upcoming release (probably v3.3). There are still cases where a lookup table can make sense, so we don't necessarily want to abandon all related issues. For most users, additional work on the lookup table wouldn't make sense right now.
Sorry for my late reply. I had not continued on the rule based lemmatizer for German because I was informed that ML lemmatizers are coming soon. If anybody is interested, here are the rules - but a lot of exceptions are still missing:
https://github.com/SuzanaK/spacy-lookups-data/commit/0ee4083a1609f1dd96ee41907c1d398c09dd52f3
Testing with the latest spacy release in a new venv, this may have been fixed:
Setup:
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download de_core_news_sm
Test:
import spacy
nlp = spacy.load('de_core_news_sm')
def print_toks(sentence):
print(f"\n{sentence}:")
doc = nlp(sentence)
print([(t, t.lemma_) for t in doc])
print_toks('Der schöne Garten')
print_toks('Ein schöner Garten')
Gives
Der schöne Garten:
[(Der, 'der'), (schöne, 'schön'), (Garten, 'Garten')]
Ein schöner Garten:
[(Ein, 'ein'), (schöner, 'schön'), (Garten, 'Garten')]
spaCy version: spacy==3.6.1 Platform: Apple M2 Pro Ventura 13.4.1 (22F82) Python version: 3.11.3 Models: de_core_news_sm
If anyone wants to test this out with other sentences, a better script is included in issue 10953, or you can drop the sentences here (marking the words you want to check with "**" before and after, eg: "Der **schöne** Garten"). 👋
spacy v3.3+ switches a number of languages to the trainable edit tree lemmatizer, so the default lemmatizer output will be different than what was discussed in the original post.
In general, some forms will be better than the lookup lemmatizer (probably most adjectives) and some will be worse (2nd person verbs that are rare in the training data). You may need to evaluate both for your task to see which is more suitable, or still consider third-party lemmatizers.
The German lookup tables in spacy-lookups-data haven't been improved (they're still kind of terrible), but to clarify this issue I'll update the issue title.