stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Adding support for Thai Language

Open korakot opened this issue 5 years ago • 53 comments

I listened to the recent PyTorch Dev Conference. Yuhao Zhang said that a new release with 83 languages is coming. Do these new 30 languages include Thai as well?

There is this PUD-Thai treebank (1000 sentences). But there are some mistakes in annotations. If you use it to train, I want to help correct it first.

Also, I know of a new Thai treebank, containing 2.4k sentences and 140k words to be released in a few months. I have access to it, but it's not finished yet. Maybe I can help prepare some parts and use it to train the Thai model.

Please suggest if we could cooperate.

korakot avatar Oct 17 '19 11:10 korakot

@korakot I think the best way is to get in touch with the Universal Dependencies community first, to start building that treebank and correcting Thai-PUD. If you could give us a work-in-progress version of the larger treebank, we can try to train some models to see if the performance is acceptable. :)

qipeng avatar Oct 17 '19 13:10 qipeng

@korakot, did you ever make more progress on getting the Thai treebank? I was looking to add more models to stanza, and one of the things I found was that there is a Thai sentiment dataset available. It would only make sense to provide that in the context of a larger set of Thai models, though, so now is a good time to revisit building models for Thai.

AngledLuffa avatar May 29 '20 06:05 AngledLuffa

There are some progress. A new constituency treebank came out, so need conversion. The Thai PUD needs update, no progress yet. TNC treebank still has only head info, but no relation-type label, I need to check the progress.

Can you provide a timeline and what type of Thai corpora do you need. I'm probably the one who know most about these free Thai corpora.

korakot avatar May 29 '20 06:05 korakot

In terms of seeing if the Thai sentiment dataset is working for us, getting data for a segmentation model soon would be great. I've read that there was a corpus called Orchid which included both segmentation and POS, but the links I can find for that are out of date. There's also a corpus for InterBEST, which should be available at thailang.nectec.or.th, but I get a 404 when I try to register an account there.

In terms of models beyond that, at a very minimum I think we would need a POS model as well. stanza doesn't currently have a constituency parser, so that dataset might not be helpful now, but we could always try to use it in corenlp or possibly one day include a constituency parser with stanza. Dependencies would be great, of course, since we generally have that model in other language. For all of these, there is no rush - when it arrives, we'll evaluate our Thai models and see if they can be added.

Thanks for the help!

AngledLuffa avatar May 29 '20 06:05 AngledLuffa

For BEST, probably the same as InterBest, (they rename it a few times). Here's my list of direct links to them.

https://gist.github.com/korakot/abf6c18c71cefe7b9107689dd904751f

For orchid, you can get it here.

https://www.nectec.or.th/corpuso/phocadownload/dl_text_thai-eng/orchid_corpus.zip

All these (and other) Thai segmentation datasets use different criteria. I plan to normalize them to the same criteria. So, you could use them as is now. And I will update you with them and other segmentation datasets, once I finish normalization.

POS tagsets are also the same. We have 2-3 tagsets which I plan to normalize them all to Chula-version of PUD.

korakot avatar May 30 '20 16:05 korakot

Thank you, this is extremely helpful.

If I understand correctly, there is segmentation and POS available in Orchid, and segmentation and NER available in Best. However, you say the segmentation standards are different. Different enough that you would recommend not mixing them? In that case, we wouldn't be able to provide a unified segmentation-POS-NER chain unless we cheat on it somehow and train the NER on Orchid segmentation, for example. I could also just train a model on both datasets at the same time and see what happens

Also, I notice a serious lack of test data. Orchid doesn't seem to have any, and the Best data doesn't have gold segmentation or labeled NER that I can see. Is there a standardized way to handle that, or should we just split it up however we think best?

AngledLuffa avatar May 30 '20 23:05 AngledLuffa

There are many problems about the current state of Thai datasets.

I can confirm that

  • Orchid has SEG and POS
  • Best has SEG and NER
  • Their SEGs are different, but no research on how much different.
  • Orchid is old (1997) and has no standardized test set
  • Best has an old guideline for SEG and NER, but it's in Thai. There is a new updated guideline, but it's not publicly released.
  • You can probably split it however you like.

The root cause is probably that we have few NLP researchers and activities. I can help solve these problems one at a time, whichever is a priority.

korakot avatar May 31 '20 03:05 korakot

Sounds interesting to me. I'll talk with my PI tomorrow and see what we can do. At a very minimum we can train two different segmenters and see what differences they have.

One limitation is, as far as I know, no one in our reseach group speaks Thai

On Sat, May 30, 2020, 8:24 PM Korakot Chaovavanich [email protected] wrote:

There are many problems about the current state of Thai datasets.

I can confirm that

The root cause is probably lack of NLP researchers and activities. I can help you solve these problems one at a time, whichever is a priority.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/148#issuecomment-636414919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLW4KMBF332XKZC4ATRUHEXXANCNFSM4JBX6FXQ .

AngledLuffa avatar May 31 '20 17:05 AngledLuffa

Today I start to check the quality of BEST segmentation. I found a few errors in even the first files.

This week I will compare BEST with ORCHID, and probably TNC and Wisesight. I hope to get a clearer idea of their differences. And I will choose one (probably TNC) and convert the rest. So we can have a consistent dataset to train word segmentation.

korakot avatar May 31 '20 18:05 korakot

Did Wisesight have a seg standard? I saw the sentiment, but it looks unsegmented.

On Sun, May 31, 2020, 11:33 AM Korakot Chaovavanich < [email protected]> wrote:

Today I start to check the quality of BEST segmentation. I found a few errors in even the first files.

This week I will compare BEST with ORCHID, and probably TNC and Wisesight. I hope to get a clearer idea of their differences. And I will choose one (probably TNC) and convert the rest. So we can have a consistent dataset to train word segmentation.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/148#issuecomment-636510030, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOKPRR2XT6E3AKKHXTRUKPHHANCNFSM4JBX6FXQ .

AngledLuffa avatar May 31 '20 18:05 AngledLuffa

We have a portion of it segmented too. You can get it here.

https://github.com/PyThaiNLP/wisesight-sentiment/tree/master/word-tokenization

korakot avatar Jun 01 '20 00:06 korakot

I spoke with my PI, and he says that if you're up for collaborating on these datasets, we're interested. Perhaps one place to start would be to test segmentation models built on the different seg datasets

AngledLuffa avatar Jun 02 '20 07:06 AngledLuffa

I am happy to collaborate on these datasets.

I guess you can evaluate the 3 segmentation datasets (Orchid, Best, Wisesight) on a downstream task (sentiment) and compare them. I will help clean up each of them. And I may convert them to the same standard. I am thinking of TNC treebank which is in my original post. If we use TNC, we will have less work when we need to train dependency parsing.

Please tell me if you have any suggestion on how we should proceed.

korakot avatar Jun 04 '20 08:06 korakot

For Orchid, the original format is a bit hard to work with. Someone (K. Vee) has converted it to XML, so that it's easier to parse out words and sentences. His webhost had gone down, but I did make a copy here.

https://github.com/korakot/thainlp/blob/master/xmlchid.xml

korakot avatar Jun 05 '20 15:06 korakot

I'm writing a script which turns Orchid into data files suitable for feeding into our tokenize and pos modules. Would you mind explaining briefly what the significance is of a space in Thai? It's labeled "PUNC" in Orchid, which is unique in my experience. Should I be treating this as a token, or should I just use it as a guaranteed break in the tokenization?

AngledLuffa avatar Jun 26 '20 01:06 AngledLuffa

Also, are these really suitable to be one word? The translation is several words long. The second one even includes a space, which adds to my confusion for what spaces represent in Thai.

"ศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ" "กระทรวงวิทยาศาสตร์ เทคโนโลยีและการพลังงาน"

AngledLuffa avatar Jun 26 '20 01:06 AngledLuffa

These are named entities. Instead of 2-level segmentation of words, then NEs, they decided it's easier (for them) to just use 1-level as NEs.

korakot avatar Jun 26 '20 01:06 korakot

Space in Thai is more like a punctuation. Normally, words are not surrounded by spaces. We only surround numbers, or names. We will break a sentence, or phrase with space. Mostly, it signify a pause in speaking. So, we don't have a strict rule when to add a space, or even two spaces.

korakot avatar Jun 26 '20 01:06 korakot

In a dataset such as the Orchid dataset, how should I denote sentences that are separated? An extra space? The tokenization module simultaneously handles word segmentation and splitting sentences, so it would be nice to have a good way of learning that.

On Thu, Jun 25, 2020 at 6:55 PM Korakot Chaovavanich < [email protected]> wrote:

Space in Thai is more like a punctuation. Normally, words are not surrounded by spaces. We only surround numbers, or names. We will break a sentence, or phrase with space. Mostly, it signify a pause in speaking. So, we don't have a strict rule when to add a space, or even two spaces.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/148#issuecomment-649904855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMOUGWVA4MSLSW73MLRYP5YDANCNFSM4JBX6FXQ .

AngledLuffa avatar Jun 26 '20 23:06 AngledLuffa

In Orchid, fortunately, we have sentence boundary tagging. If you use the XML version above, you will see the sentence tags. Still, some sentences might be too long. Thai doesn't use a period to end sentences. So, it's a conceptually ambiguous what a sentence is. A whole paragraph could be sentences stringing together with "then, then, then". For now, I suggest you use the current stentece tags. In the long run, I will try to standardize them into shorter EDUs(elementary discourse units).

korakot avatar Jun 27 '20 00:06 korakot

I may have not explained the question correctly. What I meant is that there is a statistical model for splitting sentences. For the training data for this, should I do this:

แต่จะมีความแตกต่าง ของการตอบสนองทางไดนามิกส์ (dynamics response) ทั้งการเร่ง และลดความเร็วของมอเตอร์ซึ่งยังผลทำให้ค่าเวลาควบคุมเบี่ยงเบน

(no extra space between the first two sentences in the corpus)

or should I do this:

แต่จะมีความแตกต่าง ของการตอบสนองทางไดนามิกส์ (dynamics response) ทั้งการเร่ง และลดความเร็วของมอเตอร์ ซึ่งยังผลทำให้ค่าเวลาควบคุมเบี่ยงเบน

(intentionally add some whitespace)

Perhaps the first one is more correct, as it gives the model a chance to learn how to separate sentences when there is a long block of text with no whitespace.

AngledLuffa avatar Jun 27 '20 00:06 AngledLuffa

Sorry that I misinterpreted your question.

The second one (with space) is more natural. You are more likely to find it in real data.

The first one could be useful too. A diifficult case is when a sentence ends at the end-of-line. The next sentence begins on a new line. It's hard to guess if you are a computer, whether a space will be inserted if all are in the same line.

So, I agree that a no-space method can be more useful. Or you can test and compare their performance.

korakot avatar Jun 27 '20 00:06 korakot

I was going to use newlines to separate paragraphs & documents. Hopefully that doesn't throw things off too much. It can always be tested in different configurations, as you say

AngledLuffa avatar Jun 27 '20 00:06 AngledLuffa

https://github.com/stanfordnlp/stanza/pull/368

This adds a script for converting Orchid to a format usable by our tokenizer. I'm currently using a random train/dev/test split. Is there an official or widely used split which I could have used instead?

Running it gets 87.8% accuracy on the random test set, which doesn't seem great to be honest. Perhaps a larger dataset would work better. Maybe some other splitting scheme would work better - keeping proper nouns as one token seems questionable to me.

This trains the tokenize model:

python3 scripts/tokenize/process_orchid.py extern_data/thai/xmlchid.xml data/tokenize
./scripts/run_tokenize.sh UD_Thai-orchid

Do you want to take a look? I can also get to work on some other Thai tokenization dataset if you want a couple options to compare.

AngledLuffa avatar Jun 30 '20 20:06 AngledLuffa

This works a bit better:

./scripts/run_tokenize.sh UD_Thai-orchid --dropout 0.1 --unit_dropout 0.1

92% F1 on the tokens and 71% on the sentence splits.

AngledLuffa avatar Jul 01 '20 20:07 AngledLuffa

For split, there's no official or widely use ones.

Do you want to take a look?

Yes, I may help convert those NE to words. Hope it's not too many to manual tag them. Still, I need to know the proper format to keep both tokenization and NER. Probably something like <ner>word1|word2|word3</ner> ?

92% F1 on the tokens and 71% on the sentence splits.

90+% should be OK for the start. We can add more data to improve it later.

korakot avatar Jul 02 '20 16:07 korakot

To play with the Orchid tokenizer as it currently is, you can:

git checkout -b more_thai_stuff

... the branch name being chose because I was planning on doing Best next

then

stanza.download("th", resources_branch="thai") nlp = stanza.Pipeline('th')

The default tokenizer loaded (and the only model currently available) is the one trained on Orchid.

On Thu, Jul 2, 2020 at 9:33 AM Korakot Chaovavanich < [email protected]> wrote:

For split, there's no official or widely use ones.

Do you want to take a look?

Yes, I may help convert those NE to words. Hope it's not too many to manual tag them. Still, I need to know to proper format to keep both tokenization and NER. Probably something like word1|word2|word3 ?

92% F1 on the tokens and 71% on the sentence splits.

90+% should be OK for the start. We can add more data to improve it later.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/148#issuecomment-653108927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNGCDFAHSWRN6UHNATRZSZGJANCNFSM4JBX6FXQ .

AngledLuffa avatar Jul 02 '20 20:07 AngledLuffa

BEST is quite a lot larger, so I gave that a try. I think there is something suboptimal about how I am choosing sentence splits. What I did was:

| is a word boundary newline is a sentence boundary blank lines or new files are document boundaries

There are very few document boundaries, so for one thing, this leads to very large unbroken chunks of text. Currently our tool gets a little slow when running the dev set in this paradigm. More importantly, it's pretty inaccurate in terms of finding the sentence boundaries. Is this scheme I described reasonable, or should I be doing something differently?

The more_thai_stuff branch has the conversion script, in case you want to look at the results of running it.

Both scripts are now in stanza/utils/datasets

AngledLuffa avatar Jul 04 '20 00:07 AngledLuffa

For example, it gets these numbers after a few hours of training:

th_best: token F1 = 94.74, sentence F1 = 39.25, mwt F1 = 94.74

so it's definitely learning tokens pretty well, but I think there's something wrong with what I did for sentences

AngledLuffa avatar Jul 05 '20 01:07 AngledLuffa

@AngledLuffa could you also share some examples of what the plaintext file looks like so that native speakers can help us diagnose potential issues?

qipeng avatar Jul 06 '20 05:07 qipeng