LLMs-from-scratch SP en-/decoding

SP en-/decoding

Open d-kleine opened this issue 4 months ago • 0 comments

Bug description

Issue reported here: https://livebook.manning.com/forum?product=raschka&comment=578419

This is the affected code: https://github.com/rasbt/LLMs-from-scratch/blob/78bbcb364305f7f59e0954cbe3d5e24fd36ef249/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb#L1215-L1233

This uses old en-/decoding variants for pieces (tokens) and token IDs. Newer (and more intuitive) would be:

import sentencepiece as spm


class LlamaTokenizer:
    def __init__(self, tokenizer_file):
        sp = spm.SentencePieceProcessor()
        sp.load(tokenizer_file)
        self.tokenizer = sp

    def encode(self, text):
        return self.tokenizer.encode(text, out_type=int)

    def decode(self, ids):
        return self.tokenizer.decode(ids)


tokenizer = LlamaTokenizer(tokenizer_file)

What do you think, would make it make sense to update the code here?

What operating system are you using?

None

Where do you run your code?

None

Environment

Jun 15 '25 11:06 d-kleine

LLMs-from-scratch LLMs-from-scratch copied to clipboard

SP en-/decoding

Bug description

What operating system are you using?

Where do you run your code?

Environment

LLMs-from-scratch
LLMs-from-scratch copied to clipboard