thai2transformers icon indicating copy to clipboard operation
thai2transformers copied to clipboard

Pretraining transformer based Thai language models

Results 15 thai2transformers issues
Sort by recently updated
recently updated
newest added

When loaded with `transformers.AutoTokenizer.from_pretrained`, the `model_max_len` was set to `1000000000000000019884624838656`. This results in `IndexError: index out of range in self` when using with flair in the code below. ```python from...

bug

## Thai-language specific metrics ### Sequence classification `sklearn` implementation - [ ] accuracy - F1 - precision - recall - prevalence ### Token classification `seqeval` at entity level - [...

`transformers` is currently the de facto way to train NLP models (maybe speech and image soon?). For Thai language, we have some difficulties using the default settings; for example, tokenization...

enhancement

- [x] zh_cn to th notebook - [ ] zh_cn to th script

สวัสดีครับ ผมลองใช้ wangchanberta ดูบน wongnai_reviews ตามโค้ดข้างล่างนี้ แล้วเจอ error แปลก ๆ ไม่ทราบว่าต้องแก้อย่างไรครับ ``` from transformers import ( CamembertTokenizer, AutoModelForSequenceClassification, pipeline ) from thai2transformers.preprocess import process_transformers # Load pre-trained tokenizer tokenizer =...

Both datasets have over 100k questions. Translations will make training sets: - [ ] 1.1 machine translation - [ ] 2.0 machine translation - [ ] 1.1 human translation -...

Benchmark wanchanberta results (all models; see https://arxiv.org/abs/2101.09635) against AI4thai APIs](https://aiforthai.in.th/service_bn.php): - [x] en-th machine translation - [ ] zh-th machine translation (pending model from AI Builders) - [ ] word...

documentation

Todo - [ ] Sample sentences from ws-large corpus and find words with repetitive characters

Source data to pretrain a new WangchanBERTa on legal domain