langcodes Closest Match for Punjabi (Pakistan) Not Resolving Match

Closest Match for Punjabi (Pakistan) Not Resolving Match

Open joe-sciame-wm opened this issue 2 years ago • 4 comments

I'm attempting to match a language code 'pa' with another language code 'pa-PK'.

def test_language_less_than():
    spoken_language_1 = 'pa'
    spoken_language_2 = 'pa-PK'
    match = closest_match(spoken_language_1, [spoken_language_2])
    print(match)
    self.assertEqual(0, match[1])`

def test_language_more_than(self):
    spoken_language_1 = 'pa-PK'
    spoken_language_2 = 'pa'
    match = closest_match(spoken_language_1, [spoken_language_2])
    print(match)
    self.assertEqual(0, match[1])`

This returns

('und', 1000) ('und', 1000)

I would expect this to return a match and not None. When I debug the library, I see the following which returns 54 from the tuple_distance_cached function.

desired_triple = ('pa', 'Arab', 'PK') supported_triple = ('pa', 'Guru', 'IN')

Nov 28 '22 00:11 joe-sciame-wm

I believe the issue here is that the maximize() language function is resolving pa and pa-PK to different maximized languages. I'm not a linguistic expert so I don't know if this is correct or not.

'pa': 'pa-Guru-IN', 'pa-PK': 'pa-Arab-PK',

Nov 28 '22 01:11 joe-sciame-wm

Similar issue here.

In [4]: langcodes.get("ko").language_name()
Out[4]: 'Korean'

In [5]: langcodes.get("kor_Hang").language_name()
Out[5]: 'Korean'

In [6]: langcodes.closest_match("ko", ["kor_Hang"])
Out[6]: ('und', 1000)

Mar 08 '23 06:03 BrightXiaoHan

@BrightXiaoHan @joe-sciame-wm Thank you for the input! There is likely something to improve here. If I had to guess, I think the reason for this commit was exactly the problem you are describing: https://github.com/georgkrause/langcodes/commit/59326f8bc6f5784bee558c775035e16dd1b0ce2b

Some formal hint: I took over the package and I am working on updating it here: https://github.com/georgkrause/langcodes Sadly I cannot move issues, so I created a new one and maybe we can proceed the discussion there.

Apr 08 '24 11:04 georgkrause

I think script tag is unnecessary when matching spoken languages. Maybe add a ignore_script argument to closest_match function?

Apr 11 '24 02:04 zhu

langcodes langcodes copied to clipboard

Closest Match for Punjabi (Pakistan) Not Resolving Match

langcodes
langcodes copied to clipboard