bug: Tone detector + syllable sound bug
Description
Hello, thanks for your work. First and foremost, I am not very skilled in thai, but I think there might be two errors in the functions mentioned above:
- for
ประ,sound_syllableis returninglive, but afaik it is dead. - for
เอ, as in the loanword วิตามินเอ, an out of range error is thrown intone_detector. According to http://www.thai-language.com/id/219142 it would be mid tone, so I'd guess middle class consonant, live ending.
diff --git a/tests/core/test_util.py b/tests/core/test_util.py
index 5d674221..59c647e2 100644
--- a/tests/core/test_util.py
+++ b/tests/core/test_util.py
@@ -680,9 +680,10 @@ class UtilTestCase(unittest.TestCase):
("เพราะ", "dead"),
("เกาะ", "dead"),
("แคะ", "dead"),
+ ("ประ", "dead"),
]
for i, j in test:
- self.assertEqual(sound_syllable(i), j)
+ self.assertEqual(sound_syllable(i), j, f"{i} should be determined to be a '{j}' syllable.")
def test_tone_detector(self):
data = [
@@ -710,9 +711,10 @@ class UtilTestCase(unittest.TestCase):
("f", "ผู้"),
("h", "ครับ"),
("f", "ค่ะ"),
+ ("m", "เอ"), # Pronounciation of the english letter A, as in วิตามินเอ (vitamin A)
]
for i, j in data:
- self.assertEqual(tone_detector(j), i)
+ self.assertEqual(tone_detector(j), i, f"{j} should be determined to be a '{i}' tone.")
def test_syllable_length(self):
self.assertEqual(syllable_length("มาก"), "long")
python -m unittest tests/core/test_util.py
....................F............E.
======================================================================
ERROR: test_tone_detector (tests.core.test_util.UtilTestCase.test_tone_detector)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/pythainlp/tests/core/test_util.py", line 717, in test_tone_detector
self.assertEqual(tone_detector(j), i, f"{j} should be determined to be a '{i}' tone.")
~~~~~~~~~~~~~^^^
File "/tmp/pythainlp/pythainlp/util/syllable.py", line 241, in tone_detector
s = sound_syllable(syllable)
File "/tmp/pythainlp/pythainlp/util/syllable.py", line 87, in sound_syllable
spelling_consonant = consonants[-1]
~~~~~~~~~~^^^^
IndexError: list index out of range
======================================================================
FAIL: test_sound_syllable (tests.core.test_util.UtilTestCase.test_sound_syllable)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/pythainlp/tests/core/test_util.py", line 686, in test_sound_syllable
self.assertEqual(sound_syllable(i), j, f"{i} should be determined to be a '{j}' syllable.")
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'live' != 'dead'
- live
+ dead
: ประ should be determined to be a 'dead' syllable.
----------------------------------------------------------------------
Ran 35 tests in 1.704s
FAILED (failures=1, errors=1)
Expected results
- ประ is determined as dead syllable
- เอ is determined as mid tone
Current results
- ประ is determined as live syllable
- เอ throws an error while determining the tone
Steps to reproduce
git diff apply the provided diff and run the unit tests python -m unittest tests/core/test_util.py
PyThaiNLP version
dev
Python version
3.13.1
Operating system and version
fedora
More info
No response
Possible solution
Unfortunately, I don't know.
Files
No response
Hello @kaiwa, thank you for your interest in our work!
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
สวัสดี @kaiwa ขอบคุณที่สนใจงานของเรา
ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้
If you are interested in a bunch of test cases, I have leeched the list of 1176 (1030 unique) common words from http://www.thai-language.com/ref/starred and processed them into a JSON, separated into 1478 syllables with associated tones. For cutting the thai script I have used your https://github.com/PyThaiNLP/Han-solo, the tones which are associated with each syllable are extracted from thai-language.com .
[
...,
{
"word": "ประมาณ",
"translation": "approximately; about; roughly",
"syllables": [
{
"syllable": "ประ",
"transcription": "bpra",
"tone": "L"
},
{
"syllable": "มาณ",
"transcription": "maan",
"tone": "M"
}
]
},
{
"word": "ประโยชน์",
"translation": "benefit; use; usefulness",
"syllables": [
{
"syllable": "ประ",
"transcription": "bpra",
"tone": "L"
},
{
"syllable": "โยชน์",
"transcription": "yo:ht",
"tone": "L"
}
]
},
...
]
Ah sorry for mixing issues now, just to let you know: In the json there are some words which seem to be cut incorrectly by Han-Solo. Look for empty "tone": "". It affects 6 words, maybe worth adding them to the training data.
{
"word": "กรุงเทพฯ",
"translation": "Bangkok, a province in central Thailand, having the largest provincial population, probably around 8 million (including metropolitan areas in surrounding provinces)",
"syllables": [
{
"syllable": "กรุง",
"transcription": "groong",
"tone": "M"
},
{
"syllable": "เทพ",
"transcription": "thaehp",
"tone": "F"
},
{
"syllable": "ฯ",
"transcription": "",
"tone": ""
}
]
},
{
"word": "ธาตุ",
"translation": "one of the four ancient elements: earth, water, air, or fire",
"syllables": [
{
"syllable": "ธา",
"transcription": "thaat",
"tone": "F"
},
{
"syllable": "ตุ",
"transcription": "",
"tone": ""
}
]
},
{
"word": "ประพฤติ",
"translation": "to behave; to conduct oneself; to act; to perform or do",
"syllables": [
{
"syllable": "ประ",
"transcription": "bpra",
"tone": "L"
},
{
"syllable": "พฤ",
"transcription": "phreut",
"tone": "H"
},
{
"syllable": "ติ",
"transcription": "",
"tone": ""
}
]
},
{
"word": "ประพฤติ",
"translation": "manner; conduct; deportment; behavior",
"syllables": [
{
"syllable": "ประ",
"transcription": "bpra",
"tone": "L"
},
{
"syllable": "พฤ",
"transcription": "phreut",
"tone": "H"
},
{
"syllable": "ติ",
"transcription": "",
"tone": ""
}
]
},
{
"word": "พราหมณ์",
"translation": "Brahman; an ancient religion",
"syllables": [
{
"syllable": "พรา",
"transcription": "phraam",
"tone": "M"
},
{
"syllable": "หมณ์",
"transcription": "",
"tone": ""
}
]
},
{
"word": "ราษฎร์",
"translation": "citizens; population; the people; the populace; the masses",
"syllables": [
{
"syllable": "รา",
"transcription": "raat",
"tone": "F"
},
{
"syllable": "ษฎร์",
"transcription": "",
"tone": ""
}
]
}
test case for han solo cutter with the failed words from above:
# tests/test_cut.py
import unittest
from featurizer import Featurizer
import pycrfsuite
class TestCutFunction(unittest.TestCase):
@classmethod
def setUpClass(cls):
cls.to_feature = Featurizer()
cls.tagger = pycrfsuite.Tagger()
cls.tagger.open('han_solo.crfsuite')
def test_cut_cases(self):
test_cases = [
{"text": "พราหมณ์", "expected": ["พราหมณ์"]},
{"text": "ราษฎร์", "expected": ["ราษฎร์"]},
{"text": "ธาตุ", "expected": ["ธาตุ"]},
{"text": "ประพฤติ", "expected": ["ประ", "พฤติ"]},
{"text": "กรุงเทพฯ", "expected": ["กรุง", "เทพฯ"]},
]
for case in test_cases:
with self.subTest(text=case["text"]):
text = case["text"]
x = self.to_feature.featurize(text)["X"]
y_pred = self.tagger.tag(x)
list_cut = []
for j, k in zip(text, y_pred):
if k == "1":
list_cut.append(j)
else:
list_cut[-1] += j
self.assertEqual(list_cut, case["expected"])
if __name__ == "__main__":
unittest.main()
I was able to tweak the model to pass the test cases by throwing in a bunch of stuff into han_solo_train.txt, but I have absolutely no idea what I am doing, so I am not creating a PR for that.
กรุง|เทพฯ
ธาตุ
ประ|พฤติ
พราหมณ์
ราษฎร์
ปลา|พราหมณ์
พราห|ม|ณี
แพศย์
วัน|พ|ฤ|หัสฯ
ฯลฯ
ต|ลาดฯ
เข้าเ|ฝ้าฯ
ค|ณะ|ป|ฏิ|รูปฯ
คอมฯ
โค|วิดฯ
จุ|ฬาฯ
เซ|เว่นฯ
นา|ยกฯ
จันทร์
ชัวร์
เบอร์
วัน|จันทร์
วัน|ศุกร์
วัน|เสาร์
ศุกร์
เสาร์
ญาติ
บัญ|ญัติ
ป|ฏิ|บัติ
ปรก|ติ
สม|บัติ
สม|มุติ
หลัก|ความ|ประ|พฤติ
กา|มา|รมณ์
เกิด|อา|รมณ์
ข่ม|อา|รมณ์
เจต|นา|รมณ์
เจ้า|อา|รมณ์
ม|หา|ภิ|เนษ|กรมณ์
อา|รมณ์
บ|ริ|บูรณ์
กระ|ษาปณ์
กฤษณ์
การณ์
ฐาน|เสียง|ใน|ส|ภา|ผู้|แทน|ราษฎร์
ทวย|ราษฎร์
ประ|ชา|ราษฎร์
ผู้|พิ|ทัก|ษ์สัน|ติ|ราษฎร์
รา|ษฎร์
โรง|เรียน|ราษฎร์
ส|มา|ชิก|ส|ภา|ผู้|แทน|ราษฎร์
การ|ประ|พฤติ
กำ|ไล|คุม|ประ|พฤติ
ความ|ประ|พฤติ
พราหมณ์|และผี
พุทธ|พราหมณ์
ปลา|พราหมณ์
ปลา|พราหมณ์
ปลา|พราหมณ์
Yes, han-solo is not perfect and other Thai syllables segmenter are not perfect too. I suggest you use word segmentation before get the text to syllable segmenter. Today, we use word level and subword level as standard for Thai NLP. We use syllable segmentation infrequently and Grapheme-to-phoneme conversion doesn't need syllable segmentation in today. The syllable segmentation's use case is not often in general Thai NLP.
Many Thai words are created from mixing words (basic word or คำมูล) and Thai is an isolating Language.
Example: น้ำหวาน (syrup) = น้ำ (water) + หวาน (sweet)
วันจันทร์ (Monday) = วัน (day) + จันทร์ (Monday or moon)
Our Thai dictionary is collected all words, so our word segmentation doesn't segment just basic word.