confusable_homoglyphs
confusable_homoglyphs copied to clipboard
Confusables for ㅋ vs. ᄏ
I'm confused as to why I'm getting different results for ㅋ
vs. ᄏ
. The Unicode site gives the original plus 2 additional homoglyphs for ㅋ
:
ㅋ ᄏ ᆿ
But the confusable_homoglyphs package yields just one additional homoglyph initially. I only get the other one when I look for homoglyphs of that previous result:
from confusable_homoglyphs import confusables
khieukh1s = confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh1s[0]['homoglyphs']))
# >> {'ᄏ'}
khieukh2s = confusables.is_confusable('ᄏ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh2s[0]['homoglyphs']))
# >> {'ㅋ', 'ᆿ'}
Is this expected behavior?
(Somewhat related to this issue.)
@ariutta Sorry for the late answer. I update the unicode data files and release as 3.2.0, could you please check that it now behaves as expected?
Hi @vhf, thanks for checking on this, and no worries about the delay!
I tried version 3.2.0, and I think Case 1 fails but Case 2 passes.
Case 1
Input: ㅋ
(U+314B : HANGUL LETTER KHIEUKH)
Expected Output: {'ᄏ', 'ᆿ'}
- U+110F : HANGUL CHOSEONG KHIEUKH {K}
- U+11BF : HANGUL JONGSEONG KHIEUKH {K}
Actual Output: {'ᄏ'}
- U+110F : HANGUL CHOSEONG KHIEUKH {K}
Code
from confusable_homoglyphs import confusables
set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))
Case 2
Input: ᄏ
(U+110F : HANGUL CHOSEONG KHIEUKH {K})
Expected Output: {'ᆿ','ㅋ'}
- U+11BF : HANGUL JONGSEONG KHIEUKH {K}
- U+314B : HANGUL LETTER KHIEUKH
Actual Output: {'ᆿ', 'ㅋ'}
- U+11BF : HANGUL JONGSEONG KHIEUKH {K}
- U+314B : HANGUL LETTER KHIEUKH
Code
from confusable_homoglyphs import confusables
set(map(lambda x: x['c'], confusables.is_confusable('ᄏ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))
Thanks! I'll take a closer look later. For now here's what unicode says:
314B ; 110F ; MA # ( ㅋ → ᄏ ) HANGUL LETTER KHIEUKH → HANGUL CHOSEONG KHIEUKH #
11BF ; 110F ; MA # ( ᆿ → ᄏ ) HANGUL JONGSEONG KHIEUKH → HANGUL CHOSEONG KHIEUKH #
I can confirm your two cases: 1 fails, 2 passes. The data files here confirm that this is correct, what might be not correct is my interpretation of the spec: http://www.unicode.org/reports/tr39/#Confusable_Detection
From:
314B ; 110F ; MA # ( ㅋ → ᄏ ) HANGUL LETTER KHIEUKH → HANGUL CHOSEONG KHIEUKH #
11BF ; 110F ; MA # ( ᆿ → ᄏ ) HANGUL JONGSEONG KHIEUKH → HANGUL CHOSEONG KHIEUKH #
I infer that
-
HANGUL CHOSEONG KHIEUKH
can be confused with:-
HANGUL LETTER KHIEUKH
-
HANGUL JONGSEONG KHIEUKH
-
-
HANGUL LETTER KHIEUKH
can be confused with:-
HANGUL CHOSEONG KHIEUKH
-
-
HANGUL JONGSEONG KHIEUKH
can be confused with:-
HANGUL CHOSEONG KHIEUKH
-
@ariutta Can you see the issue here? What I am missing from the spec?
Something is incorrect here I guess: https://github.com/vhf/confusable_homoglyphs/blob/master/confusable_homoglyphs/cli.py#L70 but the spec, as any spec, isn't that easy to understand. :)
Some code I played with
def test_confusable_with_a(self):
HANGUL_LETTER_KHIEUKH = u'ㅋ'
pprint(confusables.is_confusable(HANGUL_LETTER_KHIEUKH, preferred_aliases=[], greedy=True))
set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))
def test_confusable_with_b(self):
HANGUL_JONGSEONG_KHIEUKH = u'ᆿ'
pprint(confusables.is_confusable(HANGUL_JONGSEONG_KHIEUKH, preferred_aliases=[], greedy=True))
set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))
def test_confusable_with_c(self):
## this one passes and should still pass
HANGUL_CHOSEONG_KHIEUKH = u'ᄏ'
confusable_with = confusables.is_confusable(HANGUL_CHOSEONG_KHIEUKH, preferred_aliases=[], greedy=True)
confusable_char_names = set(map(lambda x: x['n'], confusable_with[0]['homoglyphs']))
expected = set(['HANGUL LETTER KHIEUKH', 'HANGUL JONGSEONG KHIEUKH'])
self.assertEqual(confusable_char_names, expected)
Hi @vhf, sorry it's taken me so long to respond.
I'm not a Unicode/Korean letter expert either, but I based my expection on the output of this unicode.org "confusables" tool: https://unicode.org/cldr/utility/confusables.jsp?a=%E3%85%8B&r=None
Does that tool correctly match the spec? I can't say for sure, but the result seems plausible at least based on the visual comparison of the characters.