semantle-he
semantle-he copied to clipboard
Apostrophes in words makes them distinct when they aren't
Adding an apostrophe (or apostrophes) anywhere in a recognizable word will be treated as a distinct word, but will have the same closeness value as the word without the apostrophes.
For example, all of the following words were accepted as distinct words, and they all had the exact same closeness value: צבע צבע' 'צבע צ'בע צב'ע צב'ע' צ''''בע
More correct behavior would probably be to either reject those words or not count them as distinct from the original.
Thanks!
It seems that gensim.corpora.wikicorpus which we are using to sanitize the w2v input sanitizes apostrophes by default. It might be the case that it does not have to be the case, but it requires some investigation. In order to allow words like ז'בוטינסקי we are deleting the apostrophes on the server side.
A possible solution to this bug (suggested by @Iddoyadlin) is to delete the apostrophe on the client side, at least until we figure out how to sanitize the data correctly.
I'm not sure if related to this but ג'ירף and ג'ירפה are not recognized