GreynirEngine
GreynirEngine copied to clipboard
Common names interpreted as verbs
I ran the 100 most common first names in Iceland through greynir.parse. No female names are interpreted as verbs but there are a few male ones. See this gist for the code.
https://gist.github.com/jokull/2c1048bbc845feb46c717ac7c77e0cc5
- Einar → eina so
- Árni → árna so
- Helgi → helga so
- Ragnar → ragna so
- Óskar → óska so
- Birgir → birgja so
- Brynjar → brynja so
- Rúnar → rúna so
- Ómar → óma so
- Reynir → reyna so
- Garðar → garða so
- Steinar → steina so
Ef ég geri greynir.parse_single(f'Forstjórinn heitir {name}') þá hjálpar það Greyni nægilega mikið til að átta sig á að um persónu er að ræða. Nema í einu tilfelli, fyrir nafnið Örn, þar verður "Örn." með punkti að terminal:
>>> sentence = greynir.parse_single('Forstjórinn heitir Örn.')
>>> sentence.terminals
[Terminal(text='Forstjórinn', lemma='forstjóri', category='no', variants=['et', 'gr', 'kk', 'nf'], index=0), Terminal(text='heitir', lemma='heita', category='so', variants=['1', 'nf
', 'et', 'fh', 'gm', 'nt', 'p3'], index=1), Terminal(text='Örn.', lemma='Örn.', category='no', variants=['et', 'hk', 'nf'], index=2)]
This is not a surprise, really, as Greynir has a preference for recognizing sentences (with verbs) rather than noun phrases, if both are possible. But for this use case, I would recommend using parse_noun_phrase() instead of parse_single() - that would always give preference to names instead of verbs. Would this solve your problem?
Having said that, the Örn. case is clearly a bug ;-)
Could this be related to "örn." being an abbreviation recognised by the tokenizer?
I'm parsing search queries. I've resorted to just searching both the lemma and the original query.