spaCy
spaCy copied to clipboard
Russian pos tagging/lemmatization/morphological analysis fails with diacritics
It seems that while there is support for tokenization with diacritics in spaCy, the project doesn't lemmatize/morph/pos tag correctly when they are used.
How to reproduce the behaviour
import ru_core_news_lg
nlp = ru_core_news_lg.load()
doc = nlp('Π― Π²ΠΈΜΠΆΡ ΠΌΡΜΠΆΠ° ΠΈ ΠΆΠ΅Π½ΡΜ')
print(doc[-1].pos_) # PROPN (incorrect. just a noun)
print(doc[-1].lemma_) # ΠΆΠ΅Π½ΡΜ (incorrect. should be ΠΆΠ΅Π½Π°)
print(doc[-1].morph) # nothing is printed which is obviously incorrect
if changed to remove the diacritics all is well
from spacy.lang.char_classes import COMBINING_DIACRITICS
diacritics_re = re.compile(f'[{COMBINING_DIACRITICS}]')
doc = nlp(diacritics_re.sub('', 'Π― Π²ΠΈΜΠΆΡ ΠΌΡΜΠΆΠ° ΠΈ ΠΆΠ΅Π½ΡΜ'))
print(doc[-1].pos_) # NOUN
print(doc[-1].lemma_) # ΠΆΠ΅Π½Π°
print(doc[-1].morph) # Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing
pymorphy3/pymorphy2 doesn't handle diacritics
it seems pymorphy3/2 doesn't handle diacritics, so perhaps before parse is called, diacritics should be removed.
diacritics_re = re.compile(f'[{COMBINING_DIACRITICS}]')
text = diacritics_re.sub('', token.text)
Thanks for the note, we'll take a look!
The suggestion for the lemmatizer is included in #12554.
For the poor tagging, etc. with statistical models for the tokens with diacritics, I think the best option would be to configure custom NORM, PREFIX, and SUFFIX features for ru and uk that strip diacritics. If you wanted to try this out with the current spacy release (v3.5), you could use a custom language to customize these methods, called lex_attr_getters in the defaults similar to this:
https://spacy.io/usage/linguistic-features#language-subclass
The defaults would be extended similar to this:
https://github.com/explosion/spaCy/blob/8e6a3d58d8fa092eede0fe323441b2aaa3c2042e/spacy/lang/ru/init.py#L13-L23
Wonderful! Thank you for the quick PR and suggestions.
I'm a noob when it comes to spaCy. I'm using it to generate tags on anki flashcards to study Russian. But, if I understand you correctly, the model I use should be trained with diacritics. Is that correct (e.g. ru_core_news_lg will not work)?
I ask because I tried making a custom language and the results were still unsatisfactory (even with a patch similar to #12554).
DIACRITICS_RE = re.compile(f'[{COMBINING_DIACRITICS}]')
def norm(s: str) -> str:
return DIACRITICS_RE.sub('', s.lower())
def prefix(s: str) -> str:
return DIACRITICS_RE.sub('', s.lower())[0]
def suffix(s: str) -> str:
return DIACRITICS_RE.sub('', s.lower())[-3:]
ATTR_GETTERS = spacy.lang.ru.LEX_ATTRS
ATTR_GETTERS.update({
attrs.NORM: norm,
attrs.PREFIX: prefix,
attrs.SUFFIX: suffix,
})
class CustomRussianDefaults(Russian.Defaults):
lex_attr_getters = ATTR_GETTERS
@spacy.registry.languages("custom_ru")
class CustomRussian(Russian):
lang = "custom_ru"
Defaults = CustomRussianDefaults
nlp = ru_core_news_lg.load()
# omitted the patching of _pymorphy_lemmatize
nlp.lang = 'custom_ru'
Test
>>> nlp('Π― Π²ΠΈΜΠΆΡ ΠΌΡΜΠΆΠ° ΠΈ ΠΆΠ΅Π½ΡΜ')[-1].morph
Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing
>>> nlp('Π― Π²ΠΈΠΆΡ ΠΌΡΠΆΠ° ΠΈ ΠΆΠ΅Π½Ρ')[-1].morph
Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing
The Animacy for ΠΆΠ΅Π½ΡΜ is inanimate with diacritics, which is incorrect.
The language and language defaults really needs to be set before the pipeline is loaded at all, but you can test this a bit by modifying the pipeline on-the-fly instead. (A few things may already be cached so it might not work 100%.)
nlp = spacy.load("ru_core_news_lg")
nlp.vocab.lex_attr_getters.update(...)
A cleaner version would basically make a copy of ru_core_news_lg where [nlp.lang] is edited to custom_ru. But with the above you should be able to test most things out. And keep in mind that the statistical models will still make mistakes, especially for ambiguous cases.
I had the same problem and discovered at least a workaround: One can create two docs, one with the original stressed text, and one with the text with diacritics removed. That way you can iterate through the docs in parallel, getting the correct (stressed) text from doc 1 while getting the grammatical information from doc 2.
It's half as fast, but it does work.