llm.c Deleting Conda/Python as a dependency entirely to dramatically decrease "latency to step"

Following up on this tweet, copy pasting, and just creating an Issue as a TODO.

""" The thing that makes this a bit complicated right now is the start latency. What bloats up the setup time right now is the dataset and its tokenization, which is all done in Python right now. Installing huggingface datasets, downloading FineWeb 10B and tokenizing it is currently ~1 hr. I think I have to look into precomputing all of this and just saving the final .bin files (20GB) of tokens somewhere (S3 or so?). You could imagine fetching data shards asynchronously while the training started. This would completely eliminate any Python dependency.

The next slightly annoying thing is cuDNN, which is a 2GB download and installation, just to get the flash attention kernel. And it compiles for 1.5 minutes. But NVIDIA reached out and mentioned they are trying to bring this down a lot.

In principle, the code should compile and run roughly instantaneously. """

TLDR I think I'll pre-tokenize FineWeb100B with GPT-2 tokenizer, zip up the .bin shards, and put them up somewhere (e.g. S3?). And then we could just download, unzip, and directly train without any Python involvement at all.

TODO think through a bit.

May 28 '24 19:05 karpathy

FineWeb100B is 1010 files total, these are raw .bin shards of 100M tokens each

Each is of size 191MB
Zipped, each is 134MB

134MB * 1010 files = 135340MB ~= 135GB

May 28 '24 19:05 karpathy

Have you played with this streaming parameter ? load_dataset("HuggingFaceFW/fineweb", name="CC-MAIN-2024-10", split='train', streaming=False,num_proc=28) I was going to use it but i have already downloaded 500GB of files

May 28 '24 23:05 banyan-god

(I used streaming originally but then started getting some errors in the tokenization workers when a request randomly fails, so I took it out)

May 28 '24 23:05 karpathy

I do something like this not very efficient as i am encoding it on the fly but I am planning to implement a thread that tokenizes and buffers it so it is available readily . https://github.com/banyan-god/llama2.c/blob/master/finewebllama2.py

May 28 '24 23:05 banyan-god