`_pad_token` attribute?

Open ceferisbarov opened this issue 4 months ago • 2 comments

Thanks for making this open-source! The following function checks for _pad_token attribute:

    def _tokenize(self, text_sample):
        if self.tokenizer._pad_token is None:
            # Some tokenizers (e.g. GPT2 tokenizer) have no padding token which causes bugs
            raise RuntimeError("If tokenizing on-the-fly, tokenizer must have a pad_token_id")

        return self.tokenizer(text_sample["text"], truncation=True, padding="max_length", max_length=self.max_seq_len)

But shouldn't it simply check for pad_token_id? My tokenizer has pad_token_id and pad_token, but no _pad_token.

Aug 11 '25 10:08 ceferisbarov

@ceferisbarov , I faced the same issue and solved it by replacing _pad_token by pad_token on lines 206 and 486 of the file src/text_data.py.

Aug 18 '25 12:08 cservan

@cservan Same here! I just wanted to confirm that it is actually a bug.

Aug 22 '25 18:08 ceferisbarov