udpipe icon indicating copy to clipboard operation
udpipe copied to clipboard

Reverse Lemmatisation?

Open blhills opened this issue 5 years ago • 5 comments

Hey Jan, thanks for the awesome work. Been using the R package to handle lemmatisation on media corpora for multiple Central and Eastern European languages, however, I am wondering if there is a way to essentially reverse the process.

so I can run this:

library(udpipe)

udmodel <- udpipe_download_model(language = "croatian")

x <- udpipe(x = "izbori izbore izbora izborima", object = udmodel)

x

doc_id paragraph_id sentence_id sentence start end term_id token_id token lemma upos xpos 1 doc1 1 1 izbori izbore izbora izborima 1 6 1 1 izbori izbor VERB Vmr3s 2 doc1 1 1 izbori izbore izbora izborima 8 13 2 2 izbore izbor NOUN Ncmpa 3 doc1 1 1 izbori izbore izbora izborima 15 20 3 3 izbora izbor NOUN Ncmpg 4 doc1 1 1 izbori izbore izbora izborima 22 29 4 4 izborima izbor NOUN Ncmpd

but what I would like is a way I can do something like

x <- udpipe(x = "izbor", object = udmodel)

and have it return the list of "izbori, izbore, izbora, izborima"

Is this possible?

blhills avatar Nov 26 '20 15:11 blhills

Hello blhills, no it is currently not possible in the API to generate all inflected forms of a lemma. The lemma rules are in the C++ code but deeply behind the general API. Maybe we can ask this in the morphodita github repository.

jwijffels avatar Nov 26 '20 15:11 jwijffels

@foxik is there a part in the morphodita C++ API which allows for generating all possible inflected forms of a lemma or can it be easily accessed on the UDPipe C++ API?

jwijffels avatar Nov 26 '20 16:11 jwijffels

MorphoDiTa offers such a functionality https://ufal.mff.cuni.cz/morphodita/api-reference#morpho_generate , but it needs a morphological dictionary (which we have only for Czech, Slovak and English). I.e., UDPipe models do not have any idea of "valid forms for a given lemma" -- they are designed only for analysis using rules like "remove -ed" (and let the tagger to choose a valid result); for generation, these rules create a lot of invalid forms for a given lemma...

foxik avatar Nov 26 '20 17:11 foxik

Thank you Milan.

@blhills I think the easiest is that on your corpus of news articles, you do the lemmatisation and keep the generated token/lemma combinations.

jwijffels avatar Nov 27 '20 14:11 jwijffels

hmm yeah so just build out my own dataset of lemmas+inflections and call that dataset when i want to find the appropriate words.

Its one of those things that logically seemed pretty simple so thought perhaps i had overlooked a way of doing it in the package.

Anyway thanks for the help and the great package it is one of the best tools i have found for the work im doing.

blhills avatar Nov 27 '20 16:11 blhills