litgpt
litgpt copied to clipboard
Long hang on Llama2-70b startup
I've been doing some experimentation with llama2 on both 7b and 70b, and at startup there is a significant hang time prior to the start of training.
On 7b it's ~3-5 minutes, and on 70b it's about 30 minutes. There is no logging around this, so I'm unsure what is actually happening at that point. What causes this hang? And (unlikely) is there any way to reduce it?
Which command are you running and have you made any changes to the repo? What's your hardware setup and PyTorch version?