langcodes
langcodes copied to clipboard
Closest Match for Punjabi (Pakistan) Not Resolving Match
I'm attempting to match a language code 'pa' with another language code 'pa-PK'.
def test_language_less_than():
spoken_language_1 = 'pa'
spoken_language_2 = 'pa-PK'
match = closest_match(spoken_language_1, [spoken_language_2])
print(match)
self.assertEqual(0, match[1])`
def test_language_more_than(self):
spoken_language_1 = 'pa-PK'
spoken_language_2 = 'pa'
match = closest_match(spoken_language_1, [spoken_language_2])
print(match)
self.assertEqual(0, match[1])`
This returns
('und', 1000) ('und', 1000)
I would expect this to return a match and not None. When I debug the library, I see the following which returns 54 from the tuple_distance_cached function.
desired_triple = ('pa', 'Arab', 'PK') supported_triple = ('pa', 'Guru', 'IN')
I believe the issue here is that the maximize() language function is resolving pa and pa-PK to different maximized languages. I'm not a linguistic expert so I don't know if this is correct or not.
'pa': 'pa-Guru-IN', 'pa-PK': 'pa-Arab-PK',
Similar issue here.
In [4]: langcodes.get("ko").language_name()
Out[4]: 'Korean'
In [5]: langcodes.get("kor_Hang").language_name()
Out[5]: 'Korean'
In [6]: langcodes.closest_match("ko", ["kor_Hang"])
Out[6]: ('und', 1000)
@BrightXiaoHan @joe-sciame-wm Thank you for the input! There is likely something to improve here. If I had to guess, I think the reason for this commit was exactly the problem you are describing: https://github.com/georgkrause/langcodes/commit/59326f8bc6f5784bee558c775035e16dd1b0ce2b
Some formal hint: I took over the package and I am working on updating it here: https://github.com/georgkrause/langcodes Sadly I cannot move issues, so I created a new one and maybe we can proceed the discussion there.
I think script tag is unnecessary when matching spoken languages.
Maybe add a ignore_script
argument to closest_match
function?