CODEC-308: change NYSIIS encoding to not remove the first character i…
With the current implementation of NYSIIS, it is possible to incorrectly remove the first character from the encoding.
According to the algorithm the first character of the string should be the first character of the encoding, then based on a bunch of other rules are applied to the string characters are removed. The implementation in commons-codec passes the entire string into the transcodeRemaining method which works for the most part and then afterwards, checks that there is at least 1 character before removing the final 'A' or 'S'.
The problem is, if you have a word like "ASH" you will end up with a single final character of "A". Similarly with "SSH" you would have "S" and the logic will currently remove it and return a blank string when it should still return at least the first letter of the original string.
@Ben-Waters Not directly related, but do you have any thoughts on https://github.com/apache/commons-codec/pull/36?
@Ben-Waters Not directly related, but do you have any thoughts on #36?
Hmmm I'm no expert on this since it isn't NYSIIS but some other algorithm but I can take a look.
I just re-read the comments it seems like:
- We are not sure if the current code implements the "plain" or original algorithm.
- We don't have access to, or cannot find, the paper for the original algorithm.
- If we implement the plain original algorithm, then we can add a new class for the newer "modifed" algorithm.
- If we do not implement the plain original algorithm, then we need to talk about that.
Help needed.