Arthit Suriyawongkul

Results 362 comments of Arthit Suriyawongkul
trafficstars

To generalize, the current normalize() cannot handle (some of) these cases: 1. Tone mark repetition 2. Spaces between consonant and tone mark Correct?

#389 should fix the (1) case (tone mark repetition) For (2), the space thing, it is possible to have this kind of text: "มีรูปภาพ ุ่มากในห้อง". A normalization will create: "มีรูปภาพุ่มากในห้อง"...

Tone mark repetition is now get covered. But the spaces between consonant and tone mark is not yet.

Pickle is fine, just be careful: - if the data is downloaded from the network, check the signature before unpickle it - if the data is from local, set a...

ผมลองตามที่ @cstorm125 เสนอครับ ใช้ _ (underscore), ใช้พหูพจน์, ใช้ suffix _th ต่อท้ายถ้าเป็นข้อมูลภาษาไทย

The 1st one is actually bad. The 2nd and 3rd can be post-processed to combine digits and symbols together to form a larger token (with some semantic).

Added notes on this to collate()'s docstring https://github.com/PyThaiNLP/pythainlp/commit/bc8223a6017f2d1a8a26a60f7f472a4ceeaa9a29

May try to implement libthai's thcoll https://github.com/tlwg/libthai/tree/master/src/thcoll See character weight table at https://github.com/tlwg/libthai/blob/master/src/thcoll/cweight.c

Does there any standard convention on how to put particular information into the notes column? I'm have put language codes to the end of the notes for a while, like...

Or we can have a "known_active_on" and "human_check_on" field to say about when was the last time a human maintainer see the site online and meant to be the site...