icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Inconsistent comparison of Korean syllables vs individual jamo

Open rs-sac opened this issue 8 months ago • 2 comments

During some experimentation, I did

let locale = locale!("ko").into();
let mut options = CollatorOptions::default();
options.strength = Some(Strength::Primary);
let collator = Collator::try_new(locale, options).unwrap();
println!("이 - {:?}", collator.compare("이", "ㅇㅣ"));
println!("일 - {:?}", collator.compare("일", "ㅇㅣㄹ"));
println!("읽 - {:?}", collator.compare("읽", "ㅇㅣㄹㄱ"));

The answers were: equal, greater, greater.

I don't know what is intended, whether the strings should compare equal or not, whether precomposed syllables should be separated from jamo or not, but I am surprised that those three answers are not all the same, at least at primary strength.

It seems that syllables compare as equal to the individual jamo if there are no batchim (bottom consonants), but not if there are any batchim.

rs-sac avatar May 15 '25 16:05 rs-sac

Hangul syllables are supposed to compare equal with the corresponding conjoining jamo, and the individual jamo here aren't conjoining jamo.

I agree that it's suprising that non-conjoining and conjoining jamo apparently aren't primary-equal, and I don't know enough of the background to say why or how intentional that is.

hsivonen avatar May 16 '25 14:05 hsivonen

The UCA spec discusses multiple methods of handling conjoining jamo, and I'm not sure which one ICU4C and ICU4X use.

hsivonen avatar May 16 '25 14:05 hsivonen