parsner icon indicating copy to clipboard operation
parsner copied to clipboard

Roberta Tokenizer

Open dehghanm opened this issue 2 years ago • 0 comments

Hi

I want to use Roberta Tokenizer. In the following, there is an example that shows how we can do this.

from transformers import AutoTokenizer model_name = "HooshvareLab/roberta-fa-zwnj-base" tokenizer = AutoTokenizer.from_pretrained(model_name) string = "این یک سند است" tokenized_string = tokenizer.tokenize(string) print(tokenized_string)

The result of the above code is as follows: ['اÛĮÙĨ', 'ĠÛĮÚ©', 'ĠسÙĨد', 'Ġاست'] However, it should be: ["این", "یک", "سند" , "است"] What is your idea to solve this issue?

dehghanm avatar Sep 03 '22 19:09 dehghanm