pythainlp icon indicating copy to clipboard operation
pythainlp copied to clipboard

'royin' engine gives wrong romanization in a lot of cases

Open bact opened this issue 5 years ago • 0 comments

Try this test set:

from pythainlp.transliterate import romanize
test_cases = {
    None: "",
    "": "",
    "หมอก": "mok",
    "หาย": "hai",
    "แมว": "maeo",
    "เดือน": "duean",
    "ดำ": "dam",
    "ดู": "du",
    "บัว": "bua",
    "กก": "kok",
    "กร": "kon",
    "กรร": "kan",
    "กรรม": "kam",
    "กรม": "krom",  # failed
    "ฝ้าย": "fai",
    "นพพร": "nopphon",
    "ทีปกร": "thipakon",  # failed
    "ธรรพ์": "than",  # failed
    "ธรรม": "tham",  # failed
    "มหา": "maha",  # failed
    "หยาก": "yak",  # failed
    "อยาก": "yak",  # failed
    "ยมก": "yamok",  # failed
    "กลัว": "klua",  # failed
    "บ้านไร่": "banrai",  # failed
    "ชารินทร์": "charin",  # failed
}
for word in test_cases:
    expect = test_cases[word]
    actual = romanize(word, engine="royin")
    print(f"{expect == actual} - word: {word} expect: {expect} actual: {actual}")

Half of them will failed:

True - word: None expect:  actual: 
True - word:  expect:  actual: 
True - word: หมอก expect: mok actual: mok
True - word: หาย expect: hai actual: hai
True - word: แมว expect: maeo actual: maeo
True - word: เดือน expect: duean actual: duean
True - word: ดำ expect: dam actual: dam
True - word: ดู expect: du actual: du
True - word: บัว expect: bua actual: bua
True - word: กก expect: kok actual: kok
True - word: กร expect: kon actual: kon
True - word: กรร expect: kan actual: kan
True - word: กรรม expect: kam actual: kam
False - word: กรม expect: krom actual: knm
True - word: ฝ้าย expect: fai actual: fai
True - word: นพพร expect: nopphon actual: nopphon
False - word: ทีปกร expect: thipakon actual: thipkon
False - word: ธรรพ์ expect: than actual: thonrop
False - word: ธรรม expect: tham actual: thnnm
False - word: มหา expect: maha actual: ma
False - word: หยาก expect: yak actual: hyak
False - word: อยาก expect: yak actual: ak
False - word: ยมก expect: yamok actual: ymk
False - word: กลัว expect: klua actual: knua
False - word: บ้านไร่ expect: banrai actual: bannai
False - word: ชารินทร์ expect: charin actual: charinthon

This test set will be added to test_transliterate.py.


Consistency Test

# these are set of two-syllable words,
# to test if the transliteration/romanization is consistent, say
# romanize(1+2) = romanize(1) + romanize(2)
_CONSISTENCY_TESTS = [
    # ("กระจก", "กระ", "จก"),  # failed
    # ("ระเบิด", "ระ", "เบิด"),  # failed
    # ("หยากไย่", "หยาก", "ไย่"),  # failed
    ("ตากใบ", "ตาก", "ใบ"),
    # ("จัดสรร", "จัด", "สรร"),  # failed
]

def test_romanize_royin_consistency(self):
    for word, part1, part2 in _CONSISTENCY_TESTS:
        self.assertEqual(
            romanize(word, engine="royin"),
            (
                romanize(part1, engine="royin")
                + romanize(part2, engine="royin")
            ),
        )

In general, I think we need a more systematic evaluation of different algorithms, including soundex.

bact avatar May 27 '20 11:05 bact