ModernBERT
ModernBERT copied to clipboard
`_pad_token` attribute?
Thanks for making this open-source! The following function checks for _pad_token attribute:
def _tokenize(self, text_sample):
if self.tokenizer._pad_token is None:
# Some tokenizers (e.g. GPT2 tokenizer) have no padding token which causes bugs
raise RuntimeError("If tokenizing on-the-fly, tokenizer must have a pad_token_id")
return self.tokenizer(text_sample["text"], truncation=True, padding="max_length", max_length=self.max_seq_len)
But shouldn't it simply check for pad_token_id? My tokenizer has pad_token_id and pad_token, but no _pad_token.
@ceferisbarov , I faced the same issue and solved it by replacing _pad_token by pad_token on lines 206 and 486 of the file src/text_data.py.
@cservan Same here! I just wanted to confirm that it is actually a bug.