TinyLlama icon indicating copy to clipboard operation
TinyLlama copied to clipboard

Taking a few days to complete SlimPajama "Train" data

Open Ahmedhasssan opened this issue 1 year ago • 1 comments

Hi, I just want to know how much time it takes to finish the "train" data preparation using this script.

python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama --destination_path data/slim_star_combined --split train --percentage 1.0

I have been running this code for the last 4 days using one A100 GPU.

Thanks

Best regards, Ahmed

Ahmedhasssan avatar Jan 23 '24 20:01 Ahmedhasssan

Hi, I think the speed depends on how much cpu cores do you have. When we use 128 cores, it seems to take about a day to do this.

ChaosCodes avatar Jan 24 '24 02:01 ChaosCodes

hey what exact version of torch lightning torchvision are you using, i did a fresh pip install -r requirements.txt on a new conda env but i still get ton of torch cuda related errors

StephennFernandes avatar Mar 02 '24 17:03 StephennFernandes