thai2transformers issues

Missing model_max_length in roberta config

When loaded with `transformers.AutoTokenizer.from_pretrained`, the `model_max_len` was set to `1000000000000000019884624838656`. This results in `IndexError: index out of range in self` when using with flair in the code below. ```python from...

ThewBear

bug

Feature: add new NER scheme

1

lalital

Refactor as package

## Thai-language specific metrics ### Sequence classification `sklearn` implementation - [ ] accuracy - F1 - precision - recall - prevalence ### Token classification `seqeval` at entity level - [...

cstorm125

Refactor thai2transformers as utility package for transformers

`transformers` is currently the de facto way to train NLP models (maybe speech and image soon?). For Thai language, we have some difficulties using the default settings; for example, tokenization...

cstorm125

enhancement

Add seq2seq notebooks and scripts

- [x] zh_cn to th notebook - [ ] zh_cn to th script

cstorm125

Error บน wongnai_reviews

3

สวัสดีครับ ผมลองใช้ wangchanberta ดูบน wongnai_reviews ตามโค้ดข้างล่างนี้ แล้วเจอ error แปลก ๆ ไม่ทราบว่าต้องแก้อย่างไรครับ ``` from transformers import ( CamembertTokenizer, AutoModelForSequenceClassification, pipeline ) from thai2transformers.preprocess import process_transformers # Load pre-trained tokenizer tokenizer =...

peune

Translate and align SQuAD 1.1 and SQuAD 2.0

Both datasets have over 100k questions. Translations will make training sets: - [ ] 1.1 machine translation - [ ] 2.0 machine translation - [ ] 1.1 human translation -...

cstorm125

Benchmark against AI4Thai

Benchmark wanchanberta results (all models; see https://arxiv.org/abs/2101.09635) against AI4thai APIs](https://aiforthai.in.th/service_bn.php): - [x] en-th machine translation - [ ] zh-th machine translation (pending model from AI Builders) - [ ] word...

cstorm125

documentation

Explore percentage of repetitive characters in wisesight corpus

Todo - [ ] Sample sentences from ws-large corpus and find words with repetitive characters

lalital

Source data for LegalWangchanBERTa

Source data to pretrain a new WangchanBERTa on legal domain

cstorm125

thai2transformers
thai2transformers copied to clipboard

Metadata

Missing model_max_length in roberta config

Feature: add new NER scheme

Refactor as package

Refactor thai2transformers as utility package for transformers

Add seq2seq notebooks and scripts

Error บน wongnai_reviews

Translate and align SQuAD 1.1 and SQuAD 2.0

Benchmark against AI4Thai

Explore percentage of repetitive characters in wisesight corpus

Source data for LegalWangchanBERTa

← Metadata

Owner

Metadata

thai2transformers thai2transformers copied to clipboard

Metadata

← Metadata

Owner

Metadata

thai2transformers
thai2transformers copied to clipboard