TinyLlama icon indicating copy to clipboard operation
TinyLlama copied to clipboard

Training Run - New Tokenizer

Open dustinwloring1988 opened this issue 9 months ago • 1 comments

Hello I was attempting to recreate this but with the tokenizer from llama3 (tiktoken) but with a few changes. I would be ok training a tiktoken from scratch if needed but could not find the code to do so. I was trying to add Fill In The Middle (FIM) tokens then train on 2 different kinds of pretraining datasets on for next text predict and one for FIM. I figured a small model of this size would be great for testing.

If anyone has more info on either way I would appreciate it. Also if there is a better training document for this project I would be interested in a link.

dustinwloring1988 avatar May 20 '24 12:05 dustinwloring1988