Lalita Lowphansirikul
Lalita Lowphansirikul
Todo - [ ] Sample sentences from ws-large corpus and find words with repetitive characters
According to the model finetuning pipeline, the input tokens are tokenized with other tokenizer (e.g. PyThaiNLp's newmm for `thainer` dataset) and then retokenizer with SentencePiece tokenizer. However, the input tokens...
Todos: - [x] Define the rules (e.g. tokenization) for BLEU score evaluation for both TH → EN and EN → TH direction. - [x] Write a new script to evaluate...