[Bug]: `disassembleHangul` is incorrect for double vowel and double consonant
Bug description
In official docs, disassembleHangul works as "한글 문자열을 글자별로 초성/중성/종성 단위로 완전히 분리하여" (in English, "seperates Korean works into onset/nucleus/coda syllables"), which is not what the function actually does.
- Double consonants (e.g., ㄳ, ㄽ) should be treated as a single syllable. Currently it doens't.
- Double vowels (e.g., ㅐ, ㅘ) should be treated as a single syllable. Currently it does sometimes.
Expected behavior
h.disassembleHangul('개') // I think this would be 'ㄱㅐ', and it is.
h.disassembleHangul('과') // I think this would be 'ㄱㅘ', but it isn't. it's `ㄱㅗㅏ`
To Reproduce
Possible Solution
skipped, because i'm unsure whether this is intentional or not.
etc.
Here's full test cases below. I think every assertion should be passed.
// double consonant 1 (ㄲ, ㄸ, ㅃ, ㅆ, ㅉ)
// onset
h.disassembleHangul('까') == 'ㄲㅏ'
h.disassembleHangul('따') == 'ㄸㅏ'
h.disassembleHangul('빠') == 'ㅃㅏ'
h.disassembleHangul('싸') == 'ㅆㅏ'
h.disassembleHangul('짜') == 'ㅉㅏ'
// code
h.disassembleHangul('갂') == 'ㄱㅏㄲ'
h.disassembleHangul('갔') == 'ㄱㅏㅆ'
// double consonant 2 (ㄳ, ㄵ, ㄶ, ㄺ, ㄻ, ㄼ, ㄽ, ㄾ, ㄿ, ㅀ, ㅄ)
// code
h.disassembleHangul('갃') == 'ㄱㅏㄳ' // false
h.disassembleHangul('갅') == 'ㄱㅏㄵ' // false
h.disassembleHangul('갆') == 'ㄱㅏㄶ' // false
h.disassembleHangul('갉') == 'ㄱㅏㄺ' // false
h.disassembleHangul('갊') == 'ㄱㅏㄻ' // false
h.disassembleHangul('갋') == 'ㄱㅏㄼ' // false
h.disassembleHangul('갌') == 'ㄱㅏㄽ' // false
h.disassembleHangul('갍') == 'ㄱㅏㄾ' // false
h.disassembleHangul('갎') == 'ㄱㅏㄿ' // false
h.disassembleHangul('갏') == 'ㄱㅏㅀ' // false
h.disassembleHangul('값') == 'ㄱㅏㅄ' // false
// single vowel (ㅏ, ㅑ, ㅓ, ㅕ, ㅗ, ㅛ, ㅜ, ㅠ, ㅡ, ㅣ)
// nucleus
h.disassembleHangul('가') == 'ㄱㅏ'
h.disassembleHangul('갸') == 'ㄱㅑ'
h.disassembleHangul('거') == 'ㄱㅓ'
h.disassembleHangul('겨') == 'ㄱㅕ'
h.disassembleHangul('고') == 'ㄱㅗ'
h.disassembleHangul('교') == 'ㄱㅛ'
h.disassembleHangul('구') == 'ㄱㅜ'
h.disassembleHangul('규') == 'ㄱㅠ'
h.disassembleHangul('그') == 'ㄱㅡ'
h.disassembleHangul('기') == 'ㄱㅣ'
// double vowel (ㅐ, ㅒ, ㅔ, ㅖ, ㅘ, ㅙ, ㅚ, ㅝ, ㅞ, ㅟ, ㅢ)
// nucleus
h.disassembleHangul('개') == 'ㄱㅐ'
h.disassembleHangul('걔') == 'ㄱㅒ'
h.disassembleHangul('게') == 'ㄱㅔ'
h.disassembleHangul('계') == 'ㄱㅖ'
h.disassembleHangul('과') == 'ㄱㅘ' // false
h.disassembleHangul('괘') == 'ㄱㅙ' // false
h.disassembleHangul('괴') == 'ㄱㅚ' // false
h.disassembleHangul('궈') == 'ㄱㅝ' // false
h.disassembleHangul('궤') == 'ㄱㅞ' // false
h.disassembleHangul('귀') == 'ㄱㅟ' // false
h.disassembleHangul('긔') == 'ㄱㅢ' // false
I guess this is highly related to the common Korean keyboard layout, which makes sense in some ways.
I just wanted to point out that the inconsistency might add another cognitive load to users.
Maybe some of its usage is re-assemble disassembled one. We might need options to deal with double consonant.
export function assembleHangul(words: string[]) {
const disassembled = disassembleHangul(words.join('')).split('');
return disassembled.reduce(binaryAssembleHangul);
}
assembleHangul(['값', 'ㅣ ', '너무', '빘 ', 'ㅏ']) // its useful to make as "갑시 너무 비싸"
Thank you for giving me a good opinion. I'll keep the issue closed because there's no further discussion. If you need to discuss it further, please feel free to open the issue.