Joe Mayer
Joe Mayer
@cyk1337 The PR that was created for this bug will allow you to add `use_node_local_storage=True` to the checkpointing section of the ds_config. By adding this flag DeepSpeed will save checkpoints...
@jeffra Is this PR still relevant? If so I can revive these changes with the current master branch.
@tjruwase Are these changes we still want? If so I can revive them with the current develop branch.
@Aillian Based on the screenshot it looks to be stuck downloading the Llama model. Did the download every complete?
It gets past the stuck point on a single GPU?
@kai-0430 Can you provide the output of `nvidia-smi topo -m`
This seems to be a systems issue. If you run without DeepSpeed does the hang also occur?
@Heathcliff-Zhao I am struggling to repro your code. 
@Heathcliff-Zhao there are no tokenizer files in it.
@Heathcliff-Zhao What is command to repro with a fresh checkout of your repo? `--model_name_or_path ./opensource` is incorrect because that directory does not exist and specifying `--model_name_or_path THUDM/chatglm3-6b` to download the...