semantle-he icon indicating copy to clipboard operation
semantle-he copied to clipboard

Dealing with plene/deficient spelling

Open Elinor10 opened this issue 3 years ago • 2 comments

A couple of days ago the solution was "דעה". I guessed "דיעה" and it got only 996/1000 (66.54).

  • The same word in plene spelling (כתיב מלא) and in deficient spelling (כתיב חסר) should generate the same similarity ranking.
  • I thought of 2 possible solutions for this: 1 - Standardize the words (guesses) - turn all plene spelled words to deficient spelling or the other way around (just like the English version of the game automatically turns all the words to lower case and British to American spelling). 2 - Reject one form of spelling (plene / deficient).

Elinor10 avatar Mar 21 '22 02:03 Elinor10

This is indeed an issue - but it should be addressed during the training of the model. In the English version it makes sense to turn British spelling into American spelling since the first one is much less frequent in the dataset. The fact that דיעה and דעה were that close suggests, IMO, that they are used in a similar frequency in Hebrew Wikipedia - so turning one into the other might result in weird behavior in some cases. For a same reason I don't think it's a good idea to reject any of the forms

(and btw, the "turning to lower case" is optional in the English version - because, for example, "nice" and "Nice" mean different things)

The main issue here is to identify plene/deficient spelling. Once this is solved it makes sense to do it pre-training.

ishefi avatar Mar 21 '22 08:03 ishefi

I also think its a good idea, but i'm unfamiliar with any packages that implement a plene to/from deficient transformation. pointing out such a package is a prerequisite for a solution for this issue.

Iddoyadlin avatar Mar 21 '22 10:03 Iddoyadlin

closing as no clear path for solution

Iddoyadlin avatar Nov 27 '22 08:11 Iddoyadlin