minbpe icon indicating copy to clipboard operation
minbpe copied to clipboard

decode() method in GPT4Tokenizer does not handle special tokens

Open Vakarva opened this issue 10 months ago • 0 comments

It appears that the decode() method in the GPT4Tokenizer class does not handle special tokens. I submitted a pull request (#63) with some updated code, but also wanted to post the issue here. Here is the original code for reference:

def decode(self, ids):
  # we have to un-permute the bytes before we decode
  text_bytes = b"".join(self.vocab[idx] for idx in ids)
  text_bytes = bytes(self.inverse_byte_shuffle[b] for b in text_bytes)
  text = text_bytes.decode("utf-8", errors="replace")
  return text

Vakarva avatar Apr 07 '24 21:04 Vakarva