commons-codec icon indicating copy to clipboard operation
commons-codec copied to clipboard

CODEC-308: change NYSIIS encoding to not remove the first character i…

Open Ben-Waters opened this issue 2 years ago • 3 comments

With the current implementation of NYSIIS, it is possible to incorrectly remove the first character from the encoding.

According to the algorithm the first character of the string should be the first character of the encoding, then based on a bunch of other rules are applied to the string characters are removed. The implementation in commons-codec passes the entire string into the transcodeRemaining method which works for the most part and then afterwards, checks that there is at least 1 character before removing the final 'A' or 'S'.

The problem is, if you have a word like "ASH" you will end up with a single final character of "A". Similarly with "SSH" you would have "S" and the logic will currently remove it and return a blank string when it should still return at least the first letter of the original string.

Ben-Waters avatar Jun 26 '23 22:06 Ben-Waters

@Ben-Waters Not directly related, but do you have any thoughts on https://github.com/apache/commons-codec/pull/36?

garydgregory avatar Jun 27 '23 12:06 garydgregory

@Ben-Waters Not directly related, but do you have any thoughts on #36?

Hmmm I'm no expert on this since it isn't NYSIIS but some other algorithm but I can take a look.

Ben-Waters avatar Jun 28 '23 03:06 Ben-Waters

I just re-read the comments it seems like:

  • We are not sure if the current code implements the "plain" or original algorithm.
  • We don't have access to, or cannot find, the paper for the original algorithm.
  • If we implement the plain original algorithm, then we can add a new class for the newer "modifed" algorithm.
  • If we do not implement the plain original algorithm, then we need to talk about that.

Help needed.

garydgregory avatar Aug 11 '23 23:08 garydgregory