Korakot Chaovavanich

Results 23 comments of Korakot Chaovavanich

A few ideas mostly still in planning phase: - tokenization together with spellcheck - autocorrect from such spellcheck - misspelling dataset - sentence (or EDU) segmentation dataset - thai word...

For YouTube subtitle dataset. Here's the current resources & work-in-progress. - A script that run every hour, searching for new youtube videos that might have a Thai subtitle. See [thai_sub.gs](https://script.google.com/d/1BCMtSZe7DFimStGg_pscaQ5bkqvUayS-eTgG0VJbaIxNm_T9a97sXhDU/edit?usp=sharing)...

There are some progress. A new constituency treebank came out, so need conversion. The Thai PUD needs update, no progress yet. TNC treebank still has only head info, but no...

For BEST, probably the same as InterBest, (they rename it a few times). Here's my list of direct links to them. https://gist.github.com/korakot/abf6c18c71cefe7b9107689dd904751f For orchid, you can get it here. https://www.nectec.or.th/corpuso/phocadownload/dl_text_thai-eng/orchid_corpus.zip...

There are many problems about the current state of Thai datasets. I can confirm that - Orchid has SEG and POS - Best has SEG and NER - Their SEGs...

Today I start to check the quality of BEST segmentation. I found a few errors in even the first files. This week I will compare BEST with ORCHID, and probably...

We have a portion of it segmented too. You can get it here. https://github.com/PyThaiNLP/wisesight-sentiment/tree/master/word-tokenization

I am happy to collaborate on these datasets. I guess you can evaluate the 3 segmentation datasets (Orchid, Best, Wisesight) on a downstream task (sentiment) and compare them. I will...

For Orchid, the original format is a bit hard to work with. Someone (K. Vee) has converted it to XML, so that it's easier to parse out words and sentences....

These are named entities. Instead of 2-level segmentation of words, then NEs, they decided it's easier (for them) to just use 1-level as NEs.