Wrong ordering from collate()
Description
pythainlp.util.collate() results a wrong ordering,
as current implementation ignores tone marks and symbols in the ordering.
Try this code:
from pythainlp.util import collate
collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"])
Expected results
Ordering according to Thai dictionary
['กวย', 'ก่วย', 'ก่วย', 'ก้วย', 'ก้วย', 'ก๊วย', 'ก๋วย']
Current results
['ก้วย', 'ก๋วย', 'ก่วย', 'ก้วย', 'ก่วย', 'ก๊วย', 'กวย']
Your environment
- PyThaiNLP version: 2.3.1
Files
pythainlp/util/collate.py
Proposed test case
class TestUtilPackage(unittest.TestCase):
# ### pythainlp.util.collate
def test_collate(self):
self.assertEqual(
collate(["ก้วย", "ก๋วย", "กวย", "ก่วย", "ก๊วย"]),
collate(["ก๋วย", "ก่วย", "ก้วย", "ก๊วย", "กวย"]),
) # should guarantee same order
self.assertEqual(
collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"]),
["กวย", "ก่วย", "ก่วย", "ก้วย", "ก้วย", "ก๊วย", "ก๋วย"],
)
Added notes on this to collate()'s docstring https://github.com/PyThaiNLP/pythainlp/commit/bc8223a6017f2d1a8a26a60f7f472a4ceeaa9a29
May try to implement libthai's thcoll https://github.com/tlwg/libthai/tree/master/src/thcoll
See character weight table at https://github.com/tlwg/libthai/blob/master/src/thcoll/cweight.c
Can I assign myself to this task. If yes, Is any rule I have to follow before pull request eg. code styling.
Can I assign myself to this task. If yes, Is any rule I have to follow before pull request eg. code styling.
Thank you. Here is the list of pull request.
- Write code by PEP8 code style. We have PEP8 checker when have pull request. You can use
pycodestyle. - Pass unittest of your function.
- If you create new function, you wants add document and unittest.
If you have a quetion, you can direct contact me at my Facebook. https://www.facebook.com/tontanwannaphong/