Arthit Suriyawongkul
Arthit Suriyawongkul
To generalize, the current normalize() cannot handle (some of) these cases: 1. Tone mark repetition 2. Spaces between consonant and tone mark Correct?
#389 should fix the (1) case (tone mark repetition) For (2), the space thing, it is possible to have this kind of text: "มีรูปภาพ ุ่มากในห้อง". A normalization will create: "มีรูปภาพุ่มากในห้อง"...
Tone mark repetition is now get covered. But the spaces between consonant and tone mark is not yet.
Pickle is fine, just be careful: - if the data is downloaded from the network, check the signature before unpickle it - if the data is from local, set a...
ผมลองตามที่ @cstorm125 เสนอครับ ใช้ _ (underscore), ใช้พหูพจน์, ใช้ suffix _th ต่อท้ายถ้าเป็นข้อมูลภาษาไทย
The 1st one is actually bad. The 2nd and 3rd can be post-processed to combine digits and symbols together to form a larger token (with some semantic).
Added notes on this to collate()'s docstring https://github.com/PyThaiNLP/pythainlp/commit/bc8223a6017f2d1a8a26a60f7f472a4ceeaa9a29
May try to implement libthai's thcoll https://github.com/tlwg/libthai/tree/master/src/thcoll See character weight table at https://github.com/tlwg/libthai/blob/master/src/thcoll/cweight.c
Does there any standard convention on how to put particular information into the notes column? I'm have put language codes to the end of the notes for a while, like...
Or we can have a "known_active_on" and "human_check_on" field to say about when was the last time a human maintainer see the site online and meant to be the site...