llama2.c icon indicating copy to clipboard operation
llama2.c copied to clipboard

Extract dataset functionality for easy extensibility

Open alxkolm opened this issue 2 years ago • 0 comments

Extract dataset functionality for easy extensibility

Summary of changes:

  1. Added dataset.py with Dataset base class. It encapsulates downloading and iterating over examples in files. There are 3 methods download(), list_files(), examples_of() in the class
  2. Download functionality moved from tinystories.py to Dataset
  3. These functions now receive Dataset as argument: train_vocab(), pretokenize(), process_shard()
  4. Pre-tokenized files now write to tokenized_{vocab_size} directories. Files tokenized by Llama2 tokenizer write to tokenized_llama2 directory.
  5. Wrapped Tinystories dataset to TinyStories class in dataset.py
  6. Added new SQLCreateContext from sql-create-context as example of extensibility.

To hold git diff as simple as possible, I have not renamed the main entry point file tinystories.py. But in future, it should be renamed to something like prepare.py.

train.py is not affected by this PR.

alxkolm avatar Aug 26 '23 13:08 alxkolm