Arthit Suriyawongkul comments

Results 362 comments of


                                            Arthit Suriyawongkul

trafficstars

Text normalization not working in some cases

To generalize, the current normalize() cannot handle (some of) these cases: 1. Tone mark repetition 2. Spaces between consonant and tone mark Correct?

Text normalization not working in some cases

#389 should fix the (1) case (tone mark repetition) For (2), the space thing, it is possible to have this kind of text: "มีรูปภาพ ุ่มากในห้อง". A normalization will create: "มีรูปภาพุ่มากในห้อง"...

Text normalization not working in some cases

Tone mark repetition is now get covered. But the spaces between consonant and tone mark is not yet.

Add dependency parsing to PyThaiNLP

Pickle is fine, just be careful: - if the data is downloaded from the network, check the signature before unpickle it - if the data is from local, set a...

Naming convention for consistency วิธีการตั้งชื่อไฟล์

ผมลองตามที่ @cstorm125 เสนอครับ ใช้ _ (underscore), ใช้พหูพจน์, ใช้ suffix _th ต่อท้ายถ้าเป็นข้อมูลภาษาไทย

Mistake in word tokenization for text containing digit related time and finance

The 1st one is actually bad. The 2nd and 3rd can be post-processed to combine digits and symbols together to form a larger token (with some semantic).

Wrong ordering from collate()

Added notes on this to collate()'s docstring https://github.com/PyThaiNLP/pythainlp/commit/bc8223a6017f2d1a8a26a60f7f472a4ceeaa9a29

Wrong ordering from collate()

May try to implement libthai's thcoll https://github.com/tlwg/libthai/tree/master/src/thcoll See character weight table at https://github.com/tlwg/libthai/blob/master/src/thcoll/cweight.c

Suggestion to add extra column for website status

Does there any standard convention on how to put particular information into the notes column? I'm have put language codes to the end of the notes for a while, like...

Proposal: Frequency tier for each URL

Or we can have a "known_active_on" and "human_check_on" field to say about when was the last time a human maintainer see the site online and meant to be the site...