anki-morphs
anki-morphs copied to clipboard
Add option to skip cards that have the same lemma
Describe the bug
Marking a word as known does not prevent a new inflection from appearing.
For each inflection of a root word, ankimorph treats it as a new word, and sets it as the new priority. Other inflections of the word are not suspended or marked as known.
Recalcing and changing the settings don't improve this behavior.
Steps to reproduce the behavior
- use subs2srs library as morph source
- use ko_core_news_sm with spacy to generate frequency list from corpus, or use collection frequency.
- use ko_core_news_sm/md/lg morphemizer in note filters
- choose setting for "am-unknowns field shows morph lemmas"
- check suspend new cards with only known morphs
- recalc
- start reviews and mark morphs as known
- watch the same "lemma" come up for review each time it appears as a different inflection in the sentence.
Expected behavior
I expect the morphemizer to distill a word to something like a lemma (spacy isn't capable of doing this properly with its korean models, but that may or may not be a separate issue). Id expect ankimorph to show me new words and bury variations of the same word. Just as if I were learning english, i don't need a card for walk, walking, walked, will walk, might walk, want to walk, and such for every single word.
Currently, it treats each inflection as a new word, so it behaves no different than if it were being separated by spaces.
My setup
- Operating System: Windows 11
- Anki Version: 23.12.1 (1a1d4d54)
- AnkiMorphs Version: 2.1.0
Additional context
Spacy has 3 korean models, ko_core_news_sm, ko_core_news_md, and ko_core_news_lg. They all functionally work the same way
The website states that it lemmatizes korean, and this isn't technically true. The Lemma_ value returned by spacy looks like this, with the raw word on the left and "lemma" value on the right:
('준비했죠', '준비+하+었+죠')
('위해서', '위하+어서')
('먹을', '먹+ㄹ')
The lemma isn't a lemma at all, but rather a break down of each word part, and the left-most part is only the "stem", which isn't the dictionary form of the word at all. the verb for "to eat" is 먹다, not 먹. 먹 is a rare noun for an ink stick used for making writing ink.
A proper lemma value for these would look like this, placing them in their dictionary form:
('준비했죠', '준비하다')
('위해서', '위하다')
('먹을', '먹다')
to explain with a single word, this is what spacy produces:
('먹다', '먹+다')
('먹었어', '먹+었+어')
('먹는데', '먹+는+데')
lemmatized properly it would look like this, where these would all be inflections of the same word:
('먹다', '먹다')
('먹었어', '먹다')
('먹는데', '먹다')
As you can see from the frequency list generated by ankimorph, a word like 괜찮다 takes up 1034 slots on the frequency list.
The value of using a morphemizer other than spaces is basically entirely lost.