confusable_homoglyphs Confusables for ㅋ vs. ᄏ

Confusables for ㅋ vs. ᄏ

Open ariutta opened this issue 6 years ago • 5 comments

I'm confused as to why I'm getting different results for ㅋ vs. ᄏ. The Unicode site gives the original plus 2 additional homoglyphs for ㅋ:

ㅋ ᄏ ᆿ

But the confusable_homoglyphs package yields just one additional homoglyph initially. I only get the other one when I look for homoglyphs of that previous result:

from confusable_homoglyphs import confusables
khieukh1s = confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh1s[0]['homoglyphs']))
# >> {'ᄏ'}
khieukh2s = confusables.is_confusable('ᄏ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh2s[0]['homoglyphs']))
# >> {'ㅋ', 'ᆿ'}

Is this expected behavior?

(Somewhat related to this issue.)

May 18 '18 06:05 ariutta

@ariutta Sorry for the late answer. I update the unicode data files and release as 3.2.0, could you please check that it now behaves as expected?

Aug 31 '18 14:08 vhf

Hi @vhf, thanks for checking on this, and no worries about the delay!

I tried version 3.2.0, and I think Case 1 fails but Case 2 passes.

Case 1

Input: ㅋ (U+314B : HANGUL LETTER KHIEUKH)

Expected Output: {'ᄏ', 'ᆿ'}

U+110F : HANGUL CHOSEONG KHIEUKH {K}
U+11BF : HANGUL JONGSEONG KHIEUKH {K}

Actual Output: {'ᄏ'}

U+110F : HANGUL CHOSEONG KHIEUKH {K}

Code

from confusable_homoglyphs import confusables
set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))

Case 2

Input: ᄏ (U+110F : HANGUL CHOSEONG KHIEUKH {K})

Expected Output: {'ᆿ','ㅋ'}

U+11BF : HANGUL JONGSEONG KHIEUKH {K}
U+314B : HANGUL LETTER KHIEUKH

Actual Output: {'ᆿ', 'ㅋ'}

U+11BF : HANGUL JONGSEONG KHIEUKH {K}
U+314B : HANGUL LETTER KHIEUKH

Code

from confusable_homoglyphs import confusables
set(map(lambda x: x['c'], confusables.is_confusable('ᄏ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))

Aug 31 '18 18:08 ariutta

Thanks! I'll take a closer look later. For now here's what unicode says:

314B ;	110F ;	MA	# ( ㅋ → ᄏ ) HANGUL LETTER KHIEUKH → HANGUL CHOSEONG KHIEUKH	# 
11BF ;	110F ;	MA	# ( ᆿ → ᄏ ) HANGUL JONGSEONG KHIEUKH → HANGUL CHOSEONG KHIEUKH	#

Aug 31 '18 18:08 vhf

I can confirm your two cases: 1 fails, 2 passes. The data files here confirm that this is correct, what might be not correct is my interpretation of the spec: http://www.unicode.org/reports/tr39/#Confusable_Detection

From:

314B ;	110F ;	MA	# ( ㅋ → ᄏ ) HANGUL LETTER KHIEUKH → HANGUL CHOSEONG KHIEUKH	# 
11BF ;	110F ;	MA	# ( ᆿ → ᄏ ) HANGUL JONGSEONG KHIEUKH → HANGUL CHOSEONG KHIEUKH	#

I infer that

HANGUL CHOSEONG KHIEUKH can be confused with:
- HANGUL LETTER KHIEUKH
- HANGUL JONGSEONG KHIEUKH
HANGUL LETTER KHIEUKH can be confused with:
- HANGUL CHOSEONG KHIEUKH
HANGUL JONGSEONG KHIEUKH can be confused with:
- HANGUL CHOSEONG KHIEUKH

@ariutta Can you see the issue here? What I am missing from the spec?

Something is incorrect here I guess: https://github.com/vhf/confusable_homoglyphs/blob/master/confusable_homoglyphs/cli.py#L70 but the spec, as any spec, isn't that easy to understand. :)

Some code I played with

def test_confusable_with_a(self):
    HANGUL_LETTER_KHIEUKH = u'ㅋ'
    pprint(confusables.is_confusable(HANGUL_LETTER_KHIEUKH, preferred_aliases=[], greedy=True))
    set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))

def test_confusable_with_b(self):
    HANGUL_JONGSEONG_KHIEUKH = u'ᆿ'
    pprint(confusables.is_confusable(HANGUL_JONGSEONG_KHIEUKH, preferred_aliases=[], greedy=True))
    set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))

def test_confusable_with_c(self):
    ## this one passes and should still pass
    HANGUL_CHOSEONG_KHIEUKH = u'ᄏ'
    confusable_with = confusables.is_confusable(HANGUL_CHOSEONG_KHIEUKH, preferred_aliases=[], greedy=True)
    confusable_char_names = set(map(lambda x: x['n'], confusable_with[0]['homoglyphs']))
    expected = set(['HANGUL LETTER KHIEUKH', 'HANGUL JONGSEONG KHIEUKH'])
    self.assertEqual(confusable_char_names, expected)

Sep 01 '18 09:09 vhf

Hi @vhf, sorry it's taken me so long to respond.

I'm not a Unicode/Korean letter expert either, but I based my expection on the output of this unicode.org "confusables" tool: https://unicode.org/cldr/utility/confusables.jsp?a=%E3%85%8B&r=None

Does that tool correctly match the spec? I can't say for sure, but the result seems plausible at least based on the visual comparison of the characters.

Jan 19 '19 00:01 ariutta

confusable_homoglyphs confusable_homoglyphs copied to clipboard

Confusables for ㅋ vs. ᄏ

Case 1

Case 2

confusable_homoglyphs
confusable_homoglyphs copied to clipboard