semantle-he icon indicating copy to clipboard operation
semantle-he copied to clipboard

Apostrophes in words makes them distinct when they aren't

Open alonscheuer opened this issue 3 years ago • 2 comments

Adding an apostrophe (or apostrophes) anywhere in a recognizable word will be treated as a distinct word, but will have the same closeness value as the word without the apostrophes.

For example, all of the following words were accepted as distinct words, and they all had the exact same closeness value: צבע צבע' 'צבע צ'בע צב'ע צב'ע' צ''''בע

More correct behavior would probably be to either reject those words or not count them as distinct from the original.

alonscheuer avatar Mar 10 '22 08:03 alonscheuer

Thanks! It seems that gensim.corpora.wikicorpus which we are using to sanitize the w2v input sanitizes apostrophes by default. It might be the case that it does not have to be the case, but it requires some investigation. In order to allow words like ז'בוטינסקי we are deleting the apostrophes on the server side.

A possible solution to this bug (suggested by @Iddoyadlin) is to delete the apostrophe on the client side, at least until we figure out how to sanitize the data correctly.

ishefi avatar Mar 10 '22 09:03 ishefi

I'm not sure if related to this but ג'ירף and ג'ירפה are not recognized

Itamarb01 avatar Apr 16 '22 12:04 Itamarb01