pythainlp icon indicating copy to clipboard operation
pythainlp copied to clipboard

Wrong ordering from collate()

Open bact opened this issue 4 years ago • 4 comments

Description

pythainlp.util.collate() results a wrong ordering, as current implementation ignores tone marks and symbols in the ordering.

Try this code:

from pythainlp.util import collate

collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"])

Expected results

Ordering according to Thai dictionary

['กวย', 'ก่วย', 'ก่วย', 'ก้วย', 'ก้วย', 'ก๊วย', 'ก๋วย']

Current results

['ก้วย', 'ก๋วย', 'ก่วย', 'ก้วย', 'ก่วย', 'ก๊วย', 'กวย']

Your environment

  • PyThaiNLP version: 2.3.1

Files

pythainlp/util/collate.py

Proposed test case

class TestUtilPackage(unittest.TestCase):

    # ### pythainlp.util.collate

    def test_collate(self):
        self.assertEqual(
            collate(["ก้วย", "ก๋วย", "กวย", "ก่วย", "ก๊วย"]),
            collate(["ก๋วย", "ก่วย", "ก้วย", "ก๊วย", "กวย"]),
        )  # should guarantee same order
        self.assertEqual(
            collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"]),
            ["กวย", "ก่วย", "ก่วย", "ก้วย", "ก้วย", "ก๊วย", "ก๋วย"],
        )

bact avatar May 16 '21 09:05 bact

Added notes on this to collate()'s docstring https://github.com/PyThaiNLP/pythainlp/commit/bc8223a6017f2d1a8a26a60f7f472a4ceeaa9a29

bact avatar May 16 '21 10:05 bact

May try to implement libthai's thcoll https://github.com/tlwg/libthai/tree/master/src/thcoll

See character weight table at https://github.com/tlwg/libthai/blob/master/src/thcoll/cweight.c

bact avatar May 16 '21 16:05 bact

Can I assign myself to this task. If yes, Is any rule I have to follow before pull request eg. code styling.

sahussawud avatar Dec 01 '21 13:12 sahussawud

Can I assign myself to this task. If yes, Is any rule I have to follow before pull request eg. code styling.

Thank you. Here is the list of pull request.

  • Write code by PEP8 code style. We have PEP8 checker when have pull request. You can use pycodestyle.
  • Pass unittest of your function.
  • If you create new function, you wants add document and unittest.

If you have a quetion, you can direct contact me at my Facebook. https://www.facebook.com/tontanwannaphong/

wannaphong avatar Dec 01 '21 15:12 wannaphong