wordninja icon indicating copy to clipboard operation
wordninja copied to clipboard

Fix splitting algorithm for unknown characters

Open timminator opened this issue 6 months ago • 0 comments

This PR solve issue #10 by reducing the penalty-term and by not allowing the algorithm to choose a "one-time" payment strategy. A detailed explanation can be found in the issue thread.

The penalty-term of 25 was not chosen arbitrarily - it was chosen by taking the max cost a word can get assigned to into account. The max_cost can be calculated like this:

\text{Cost}_{\text{max}} = \ln(N \cdot \ln (N))

If you set the max cost value to 25 and solve the equation, you get the result, that a dictionary would need to have more than roughly 3.3 billion entries to cross the penalty term that an unknown word would get. A penalty term of 20 equals 28 million possible entries, but 25 is definitely on the save side, and works perfectly fine with the algorithm.

This took me quite some time to figure out, so I would appreciate it if this could be merged.

Fix #10

timminator avatar Jun 23 '25 21:06 timminator