litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Long hang on Llama2-70b startup

Open Chris113113 opened this issue 1 year ago • 1 comments

I've been doing some experimentation with llama2 on both 7b and 70b, and at startup there is a significant hang time prior to the start of training.

On 7b it's ~3-5 minutes, and on 70b it's about 30 minutes. There is no logging around this, so I'm unsure what is actually happening at that point. What causes this hang? And (unlikely) is there any way to reduce it?

Chris113113 avatar Jan 04 '24 05:01 Chris113113

Which command are you running and have you made any changes to the repo? What's your hardware setup and PyTorch version?

carmocca avatar Jan 19 '24 16:01 carmocca