[WIP] Simplified preparation of pretraining datasets

Open awaelchli opened this issue 1 year ago • 1 comments

The idea is that data modules that expose prepare_data can be called in advance to prepare data. For in-memory datasets (e.g. finetuning) this is a no-op and not required. But for pretraining datasets (terrabytes), this is very useful as it can be scaled to a large cluster with a single command:

litgpt prepare --data TinyLlama --tokenizer_dir checkpoints/meta-llama/Llama-2-7b-hf

Mar 07 '24 23:03 awaelchli

This is blocked by not being able to run two optimize calls together. Maybe we should have tutorials suggest python -m litgpt.data.prepare_* in the meantime for people who use this externally.

Mar 14 '24 16:03 carmocca