Missing model_max_length in roberta config

Open ThewBear opened this issue 3 years ago • 0 comments

When loaded with transformers.AutoTokenizer.from_pretrained, the model_max_len was set to 1000000000000000019884624838656.

This results in IndexError: index out of range in self when using with flair in the code below.

from flair.embeddings import TransformerDocumentEmbeddings

wangchanberta = TransformerDocumentEmbeddings('airesearch/wangchanberta-base-att-spm-uncased')
wangchanberta .embed(sentence)

After searching, I found this issue https://github.com/huggingface/transformers/issues/14315#issuecomment-964363283 and it stated that model_max_length is missing from the configuration file.

My current workaround is manually calling the following code to overrides the missing config.

wangchanberta.tokenizer.model_max_length = 510

Jul 02 '22 13:07 ThewBear