llama2.c
llama2.c copied to clipboard
Extract dataset functionality for easy extensibility
Extract dataset functionality for easy extensibility
Summary of changes:
- Added
dataset.pywithDatasetbase class. It encapsulates downloading and iterating over examples in files. There are 3 methodsdownload(),list_files(),examples_of()in the class - Download functionality moved from
tinystories.pytoDataset - These functions now receive Dataset as argument:
train_vocab(),pretokenize(),process_shard() - Pre-tokenized files now write to
tokenized_{vocab_size}directories. Files tokenized by Llama2 tokenizer write totokenized_llama2directory. - Wrapped Tinystories dataset to TinyStories class in
dataset.py - Added new SQLCreateContext from sql-create-context as example of extensibility.
To hold git diff as simple as possible, I have not renamed the main entry point file tinystories.py. But in future, it should be renamed to something like prepare.py.
train.py is not affected by this PR.