LLMs-from-scratch icon indicating copy to clipboard operation
LLMs-from-scratch copied to clipboard

SP en-/decoding

Open d-kleine opened this issue 4 months ago • 0 comments

Bug description

Issue reported here: https://livebook.manning.com/forum?product=raschka&comment=578419

This is the affected code: https://github.com/rasbt/LLMs-from-scratch/blob/78bbcb364305f7f59e0954cbe3d5e24fd36ef249/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb#L1215-L1233

This uses old en-/decoding variants for pieces (tokens) and token IDs. Newer (and more intuitive) would be:

import sentencepiece as spm


class LlamaTokenizer:
    def __init__(self, tokenizer_file):
        sp = spm.SentencePieceProcessor()
        sp.load(tokenizer_file)
        self.tokenizer = sp

    def encode(self, text):
        return self.tokenizer.encode(text, out_type=int)

    def decode(self, ids):
        return self.tokenizer.decode(ids)


tokenizer = LlamaTokenizer(tokenizer_file)

What do you think, would make it make sense to update the code here?

What operating system are you using?

None

Where do you run your code?

None

Environment




d-kleine avatar Jun 15 '25 11:06 d-kleine