pythainlp bug: Tone detector + syllable sound bug

Description

Hello, thanks for your work. First and foremost, I am not very skilled in thai, but I think there might be two errors in the functions mentioned above:

for ประ, sound_syllable is returning live, but afaik it is dead.
for เอ, as in the loanword วิตามินเอ, an out of range error is thrown in tone_detector. According to http://www.thai-language.com/id/219142 it would be mid tone, so I'd guess middle class consonant, live ending.

diff --git a/tests/core/test_util.py b/tests/core/test_util.py
index 5d674221..59c647e2 100644
--- a/tests/core/test_util.py
+++ b/tests/core/test_util.py
@@ -680,9 +680,10 @@ class UtilTestCase(unittest.TestCase):
             ("เพราะ", "dead"),
             ("เกาะ", "dead"),
             ("แคะ", "dead"),
+            ("ประ", "dead"),
         ]
         for i, j in test:
-            self.assertEqual(sound_syllable(i), j)
+            self.assertEqual(sound_syllable(i), j, f"{i} should be determined to be a '{j}' syllable.")
 
     def test_tone_detector(self):
         data = [
@@ -710,9 +711,10 @@ class UtilTestCase(unittest.TestCase):
             ("f", "ผู้"),
             ("h", "ครับ"),
             ("f", "ค่ะ"),
+            ("m", "เอ"), # Pronounciation of the english letter A, as in วิตามินเอ (vitamin A)
         ]
         for i, j in data:
-            self.assertEqual(tone_detector(j), i)
+            self.assertEqual(tone_detector(j), i, f"{j} should be determined to be a '{i}' tone.")
 
     def test_syllable_length(self):
         self.assertEqual(syllable_length("มาก"), "long")

python -m unittest tests/core/test_util.py
....................F............E.
======================================================================
ERROR: test_tone_detector (tests.core.test_util.UtilTestCase.test_tone_detector)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pythainlp/tests/core/test_util.py", line 717, in test_tone_detector
    self.assertEqual(tone_detector(j), i, f"{j} should be determined to be a '{i}' tone.")
                     ~~~~~~~~~~~~~^^^
  File "/tmp/pythainlp/pythainlp/util/syllable.py", line 241, in tone_detector
    s = sound_syllable(syllable)
  File "/tmp/pythainlp/pythainlp/util/syllable.py", line 87, in sound_syllable
    spelling_consonant = consonants[-1]
                         ~~~~~~~~~~^^^^
IndexError: list index out of range

======================================================================
FAIL: test_sound_syllable (tests.core.test_util.UtilTestCase.test_sound_syllable)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pythainlp/tests/core/test_util.py", line 686, in test_sound_syllable
    self.assertEqual(sound_syllable(i), j, f"{i} should be determined to be a '{j}' syllable.")
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'live' != 'dead'
- live
+ dead
 : ประ should be determined to be a 'dead' syllable.

----------------------------------------------------------------------
Ran 35 tests in 1.704s

FAILED (failures=1, errors=1)

Expected results

ประ is determined as dead syllable
เอ is determined as mid tone

Current results

ประ is determined as live syllable
เอ throws an error while determining the tone

Steps to reproduce

git diff apply the provided diff and run the unit tests python -m unittest tests/core/test_util.py

PyThaiNLP version

dev

Python version

3.13.1

Operating system and version

fedora

More info

No response

Possible solution

Unfortunately, I don't know.

Files

No response

Jan 05 '25 00:01 kaiwa

Hello @kaiwa, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

สวัสดี @kaiwa ขอบคุณที่สนใจงานของเรา

ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้

Jan 05 '25 00:01 github-actions[bot]

If you are interested in a bunch of test cases, I have leeched the list of 1176 (1030 unique) common words from http://www.thai-language.com/ref/starred and processed them into a JSON, separated into 1478 syllables with associated tones. For cutting the thai script I have used your https://github.com/PyThaiNLP/Han-solo, the tones which are associated with each syllable are extracted from thai-language.com .

syllables.json

[
    ...,
    {
        "word": "ประมาณ",
        "translation": "approximately; about; roughly",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "มาณ",
                "transcription": "maan",
                "tone": "M"
            }
        ]
    },
    {
        "word": "ประโยชน์",
        "translation": "benefit; use; usefulness",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "โยชน์",
                "transcription": "yo:ht",
                "tone": "L"
            }
        ]
    },
    ...
]

Jan 05 '25 12:01 kaiwa

Ah sorry for mixing issues now, just to let you know: In the json there are some words which seem to be cut incorrectly by Han-Solo. Look for empty "tone": "". It affects 6 words, maybe worth adding them to the training data.

    {
        "word": "กรุงเทพฯ",
        "translation": "Bangkok, a province in central Thailand, having the largest provincial population, probably around 8 million (including metropolitan areas in surrounding provinces)",
        "syllables": [
            {
                "syllable": "กรุง",
                "transcription": "groong",
                "tone": "M"
            },
            {
                "syllable": "เทพ",
                "transcription": "thaehp",
                "tone": "F"
            },
            {
                "syllable": "ฯ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ธาตุ",
        "translation": "one of the four ancient elements: earth, water, air, or fire",
        "syllables": [
            {
                "syllable": "ธา",
                "transcription": "thaat",
                "tone": "F"
            },
            {
                "syllable": "ตุ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ประพฤติ",
        "translation": "to behave; to conduct oneself; to act; to perform or do",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "พฤ",
                "transcription": "phreut",
                "tone": "H"
            },
            {
                "syllable": "ติ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ประพฤติ",
        "translation": "manner; conduct; deportment; behavior",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "พฤ",
                "transcription": "phreut",
                "tone": "H"
            },
            {
                "syllable": "ติ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "พราหมณ์",
        "translation": "Brahman; an ancient religion",
        "syllables": [
            {
                "syllable": "พรา",
                "transcription": "phraam",
                "tone": "M"
            },
            {
                "syllable": "หมณ์",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ราษฎร์",
        "translation": "citizens; population; the people; the populace; the masses",
        "syllables": [
            {
                "syllable": "รา",
                "transcription": "raat",
                "tone": "F"
            },
            {
                "syllable": "ษฎร์",
                "transcription": "",
                "tone": ""
            }
        ]
    }

Jan 05 '25 13:01 kaiwa

test case for han solo cutter with the failed words from above:

# tests/test_cut.py
import unittest
from featurizer import Featurizer
import pycrfsuite

class TestCutFunction(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        cls.to_feature = Featurizer()
        cls.tagger = pycrfsuite.Tagger()
        cls.tagger.open('han_solo.crfsuite')

    def test_cut_cases(self):
        test_cases = [
            {"text": "พราหมณ์", "expected": ["พราหมณ์"]},
            {"text": "ราษฎร์", "expected": ["ราษฎร์"]},
            {"text": "ธาตุ", "expected": ["ธาตุ"]},
            {"text": "ประพฤติ", "expected": ["ประ", "พฤติ"]},
            {"text": "กรุงเทพฯ", "expected": ["กรุง", "เทพฯ"]},
        ]

        for case in test_cases:
            with self.subTest(text=case["text"]):
                text = case["text"]
                x = self.to_feature.featurize(text)["X"]
                y_pred = self.tagger.tag(x)

                list_cut = []
                for j, k in zip(text, y_pred):
                    if k == "1":
                        list_cut.append(j)
                    else:
                        list_cut[-1] += j

                self.assertEqual(list_cut, case["expected"])


if __name__ == "__main__":
    unittest.main()

I was able to tweak the model to pass the test cases by throwing in a bunch of stuff into han_solo_train.txt, but I have absolutely no idea what I am doing, so I am not creating a PR for that.

กรุง|เทพฯ
ธาตุ
ประ|พฤติ
พราหมณ์
ราษฎร์
ปลา|พราหมณ์
พราห|ม|ณี
แพศย์
วัน|พ|ฤ|หัสฯ
ฯลฯ
ต|ลาดฯ
เข้าเ|ฝ้าฯ
ค|ณะ|ป|ฏิ|รูปฯ
คอมฯ
โค|วิดฯ
จุ|ฬาฯ
เซ|เว่นฯ
นา|ยกฯ
จันทร์
ชัวร์
เบอร์
วัน|จันทร์
วัน|ศุกร์ 
วัน|เสาร์
ศุกร์
เสาร์
ญาติ
บัญ|ญัติ
ป|ฏิ|บัติ
ปรก|ติ
สม|บัติ
สม|มุติ
หลัก|ความ|ประ|พฤติ
กา|มา|รมณ์ 
เกิด|อา|รมณ์
ข่ม|อา|รมณ์
เจต|นา|รมณ์
เจ้า|อา|รมณ์
ม|หา|ภิ|เนษ|กรมณ์
อา|รมณ์
บ|ริ|บูรณ์
กระ|ษาปณ์
กฤษณ์
การณ์
ฐาน|เสียง|ใน|ส|ภา|ผู้|แทน|ราษฎร์
ทวย|ราษฎร์
ประ|ชา|ราษฎร์
ผู้|พิ|ทัก|ษ์สัน|ติ|ราษฎร์
รา|ษฎร์
โรง|เรียน|ราษฎร์
ส|มา|ชิก|ส|ภา|ผู้|แทน|ราษฎร์
การ|ประ|พฤติ
กำ|ไล|คุม|ประ|พฤติ
ความ|ประ|พฤติ
พราหมณ์|และผี
พุทธ|พราหมณ์
ปลา|พราหมณ์
ปลา|พราหมณ์
ปลา|พราหมณ์

Jan 05 '25 14:01 kaiwa

Yes, han-solo is not perfect and other Thai syllables segmenter are not perfect too. I suggest you use word segmentation before get the text to syllable segmenter. Today, we use word level and subword level as standard for Thai NLP. We use syllable segmentation infrequently and Grapheme-to-phoneme conversion doesn't need syllable segmentation in today. The syllable segmentation's use case is not often in general Thai NLP.

Jan 05 '25 17:01 wannaphong

Many Thai words are created from mixing words (basic word or คำมูล) and Thai is an isolating Language.

Example: น้ำหวาน (syrup) = น้ำ (water) + หวาน (sweet)

วันจันทร์ (Monday) = วัน (day) + จันทร์ (Monday or moon)

Our Thai dictionary is collected all words, so our word segmentation doesn't segment just basic word.

Jan 05 '25 17:01 wannaphong