LLMs-from-scratch
                                
                                
                                
                                    LLMs-from-scratch copied to clipboard
                            
                            
                            
                        SP en-/decoding
Bug description
Issue reported here: https://livebook.manning.com/forum?product=raschka&comment=578419
This is the affected code: https://github.com/rasbt/LLMs-from-scratch/blob/78bbcb364305f7f59e0954cbe3d5e24fd36ef249/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb#L1215-L1233
This uses old en-/decoding variants for pieces (tokens) and token IDs. Newer (and more intuitive) would be:
import sentencepiece as spm
class LlamaTokenizer:
    def __init__(self, tokenizer_file):
        sp = spm.SentencePieceProcessor()
        sp.load(tokenizer_file)
        self.tokenizer = sp
    def encode(self, text):
        return self.tokenizer.encode(text, out_type=int)
    def decode(self, ids):
        return self.tokenizer.decode(ids)
tokenizer = LlamaTokenizer(tokenizer_file)
What do you think, would make it make sense to update the code here?
What operating system are you using?
None
Where do you run your code?
None
Environment