lute-v3 icon indicating copy to clipboard operation
lute-v3 copied to clipboard

Investigate finding "similar terms" for parent suggestions

Open jzohrab opened this issue 1 year ago • 4 comments

Currently, Lute asks the user to enter in the characters for the parent match. They may be a way to pre-calc things that could possibly be likely parents for a term.

LWT uses an algorithm ostensibly based on http://www.catalysoft.com/articles/StrikeAMatch.html, but I don't know how accurate LWT's code is. That algorithm has a possibly buggy python implementation in https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings -- there are libraries out there that have this and other algos we could try.

I'm not sure how well this works with things like Japanese (char-based), or with accents -- should accents be "normalized" out of words? Does that even work for languages like Thai or Armenian? I don't know.

Then the algorithm just needs a simple speed check -- e.g if a user (like me) has 100K+ terms in a db, does it respond reasonably quickly with the highest matches?

jzohrab avatar Dec 23 '23 14:12 jzohrab

Speaking for Korean, if the first two characters are the same, there's a reasonable likelihood of a parent/child relationship.

jamesdeluk avatar Feb 08 '24 19:02 jamesdeluk

There is a browser add-on that tries to find similar terms and set a default value for it and mark it differently: https://github.com/geajack/Wordology

There could be a plug-in in Lute that tries to find the lemma of a term, too.

GrimPixel avatar Feb 12 '24 03:02 GrimPixel

Wordology looks super, thanks for the link :-)

Yes something like a mapping of words, either pre-computed or with a lemma lookup, is close to the idea. The csv import is also a way of specifying a bulk mapping, it's not a terrible way to do it.

jzohrab avatar Feb 12 '24 04:02 jzohrab

Try this: https://github.com/adbar/simplemma

GrimPixel avatar Feb 14 '24 23:02 GrimPixel