Aditya Tiwari comments

Repositories
Issues
Comments

Results 1 comments of


                                            Aditya Tiwari

Issue with Tokenizer (fast) splitting `<mask>` into constituent added special tokens despite mask token in vocab and in special tokens map

I also faced the same issue when trained using the ByteLevelBPETokenizer suggested in https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=IMnymRDLe0hi Tokenizer training: ``` tokenizer = ByteLevelBPETokenizer() tokenizer.train_from_iterator(iterator=LIST_OF_STRINGS, vocab_size=52000, min_frequency=2, special_tokens=[ "", "", "", "", "", ])...