thai2transformers
thai2transformers copied to clipboard
Pretraining transformer based Thai language models
When loaded with `transformers.AutoTokenizer.from_pretrained`, the `model_max_len` was set to `1000000000000000019884624838656`. This results in `IndexError: index out of range in self` when using with flair in the code below. ```python from...
## Thai-language specific metrics ### Sequence classification `sklearn` implementation - [ ] accuracy - F1 - precision - recall - prevalence ### Token classification `seqeval` at entity level - [...
`transformers` is currently the de facto way to train NLP models (maybe speech and image soon?). For Thai language, we have some difficulties using the default settings; for example, tokenization...
- [x] zh_cn to th notebook - [ ] zh_cn to th script
สวัสดีครับ ผมลองใช้ wangchanberta ดูบน wongnai_reviews ตามโค้ดข้างล่างนี้ แล้วเจอ error แปลก ๆ ไม่ทราบว่าต้องแก้อย่างไรครับ ``` from transformers import ( CamembertTokenizer, AutoModelForSequenceClassification, pipeline ) from thai2transformers.preprocess import process_transformers # Load pre-trained tokenizer tokenizer =...
Both datasets have over 100k questions. Translations will make training sets: - [ ] 1.1 machine translation - [ ] 2.0 machine translation - [ ] 1.1 human translation -...
Benchmark wanchanberta results (all models; see https://arxiv.org/abs/2101.09635) against AI4thai APIs](https://aiforthai.in.th/service_bn.php): - [x] en-th machine translation - [ ] zh-th machine translation (pending model from AI Builders) - [ ] word...
Todo - [ ] Sample sentences from ws-large corpus and find words with repetitive characters
Source data to pretrain a new WangchanBERTa on legal domain