litgpt
litgpt copied to clipboard
[WIP] Simplified preparation of pretraining datasets
The idea is that data modules that expose prepare_data can be called in advance to prepare data. For in-memory datasets (e.g. finetuning) this is a no-op and not required. But for pretraining datasets (terrabytes), this is very useful as it can be scaled to a large cluster with a single command:
litgpt prepare --data TinyLlama --tokenizer_dir checkpoints/meta-llama/Llama-2-7b-hf
This is blocked by not being able to run two optimize calls together. Maybe we should have tutorials suggest python -m litgpt.data.prepare_* in the meantime for people who use this externally.