spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

German adjectives ending on `-e` are not lemmatized using the lookup lemmatizer

Open SuzanaK opened this issue 6 years ago • 14 comments
trafficstars

How to reproduce the behaviour

import spacy
nlp = spacy.load('de')
s1 = 'Der schöne Garten'                                                                                                                                                             
doc = nlp(s1)                                                                                                                                                                        
[(t, t.lemma_) for t in doc]                                                                                                                                                         
 >> [(Der, 'der'), (schöne, 'schöne'), (Garten, 'Garten')]

s2 = 'Ein schöner Garten'  
doc = nlp(s2)                                                                                                                                                                        
[(t, t.lemma_) for t in doc]                                                                                                                                                         
>> [(Ein, 'Ein'), (schöner, 'schön'), (Garten, 'Garten')]

My Environment

  • spaCy version: 2.2.2
  • Platform: Linux-5.0.0-25-generic-x86_64-with-LinuxMint-19.2-tina
  • Python version: 3.6.7
  • Models: de

Reason

As far as I can see, all forms of German adjectives ending on e in spacy-lookups-data/spacy_lookups_data/data/de_lemma_lookup.json are capitalized, e.g.:

"Dekorative": "dekorativ",
"Weiße": "Weiß",
"Schöne": "Schönes",

SuzanaK avatar Nov 11 '19 10:11 SuzanaK

The lookup tables, while sometimes better than nothing, are pretty terrible. They don't take any context into account and are very unpredictable / brittle. Many adjectives ending in -e are there, so it's all kind of strange. I'd recommend an alternate lemmatizer for German for now, see #2668 for some suggestions.

adrianeboyd avatar Nov 11 '19 14:11 adrianeboyd

Hi @adrianeboyd, I've started with some tests today for a rule based lemmatizer and would like to propose a PR soon. Will we still maintain the lookup table afterwards? Do lookup table have precedence over the rule based lemmatizing? Or would all words that are already covered by a rule be removed from the lookup table to make it smaller?

SuzanaK avatar Nov 11 '19 16:11 SuzanaK

A PR for this would be great! You might want to get in touch with Guadalupe Romero (@guadi1994), who has started working on this for Spanish and German.

The rule-based lemmatizer requires tags from the tagger, so the lookup table is used as a backup to use when no tags are available. The rules should have precedence over the table and I think that if there are rules, the lookup table is not used at all, but I might be mistaken.

Since it's used as a backup, it would probably make sense to fix some of the really weird closed class errors in the table, like "er" -> "ich". (We do have plans to add statisticals models for morphology and lemmatization, which could hopefully replace all of this, but it's all still in progress.)

adrianeboyd avatar Nov 12 '19 10:11 adrianeboyd

You are right, the lookup is ignored as soon as there are rules. That means I can't have rules and enhance them gradually but have to develop rules and add all exceptions (and esp. for the nouns, there will be many) to the exceptions list. I'd also have to write an extra lemmatizing method because the standard lemmatize method would change all nouns to lower case, which won't work for German. I won't be able to do that in the next time but I'll try to fix the worst errors in the lookup table.

SuzanaK avatar Nov 18 '19 08:11 SuzanaK

If someone can let me know the following:

Is this here still an issue? Where is the file referenced in the initial comment?

Then I'd like to take care of this issue.

lg

EBoiSha avatar Jan 03 '22 20:01 EBoiSha

I don't think this has been addressed yet. The data is in this repo if you want to have a look at it.

polm avatar Jan 04 '22 04:01 polm

Let me also link in this more recent issue about German lemmas: https://github.com/explosion/spaCy/issues/9799

polm avatar Jan 04 '22 05:01 polm

Okay, at least the issue mentioned in this thread, I can't find it. The file has also been updated after this issue here has been opened.

Is there any way to confirm if this issue is still up to date? It appears that it can be closed but I can not tell for sure.

EBoiSha avatar Jan 18 '22 21:01 EBoiSha

Okay, I think due to https://explosion.ai/blog/edit-tree-lemmatizer we could close this task or at least additional work would not make that much sense if lookup tables can be avoided

EBoiSha avatar Jan 18 '22 22:01 EBoiSha

Yes, we're hoping to be able to include the edit tree lemmatizer in an upcoming release (probably v3.3). There are still cases where a lookup table can make sense, so we don't necessarily want to abandon all related issues. For most users, additional work on the lookup table wouldn't make sense right now.

adrianeboyd avatar Jan 31 '22 08:01 adrianeboyd

Sorry for my late reply. I had not continued on the rule based lemmatizer for German because I was informed that ML lemmatizers are coming soon. If anybody is interested, here are the rules - but a lot of exceptions are still missing:

https://github.com/SuzanaK/spacy-lookups-data/commit/0ee4083a1609f1dd96ee41907c1d398c09dd52f3

SuzanaK avatar Feb 08 '22 13:02 SuzanaK

Testing with the latest spacy release in a new venv, this may have been fixed:

Setup:

python3 -m venv .venv
source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download de_core_news_sm

Test:

import spacy
nlp = spacy.load('de_core_news_sm')

def print_toks(sentence):
    print(f"\n{sentence}:")
    doc = nlp(sentence)
    print([(t, t.lemma_) for t in doc])
    
print_toks('Der schöne Garten')
print_toks('Ein schöner Garten')

Gives

Der schöne Garten:
[(Der, 'der'), (schöne, 'schön'), (Garten, 'Garten')]

Ein schöner Garten:
[(Ein, 'ein'), (schöner, 'schön'), (Garten, 'Garten')]

spaCy version: spacy==3.6.1 Platform: Apple M2 Pro Ventura 13.4.1 (22F82) Python version: 3.11.3 Models: de_core_news_sm

jzohrab avatar Oct 01 '23 17:10 jzohrab

If anyone wants to test this out with other sentences, a better script is included in issue 10953, or you can drop the sentences here (marking the words you want to check with "**" before and after, eg: "Der **schöne** Garten"). 👋

jzohrab avatar Oct 01 '23 21:10 jzohrab

spacy v3.3+ switches a number of languages to the trainable edit tree lemmatizer, so the default lemmatizer output will be different than what was discussed in the original post.

In general, some forms will be better than the lookup lemmatizer (probably most adjectives) and some will be worse (2nd person verbs that are rare in the training data). You may need to evaluate both for your task to see which is more suitable, or still consider third-party lemmatizers.

The German lookup tables in spacy-lookups-data haven't been improved (they're still kind of terrible), but to clarify this issue I'll update the issue title.

adrianeboyd avatar Oct 02 '23 09:10 adrianeboyd