Joe Mayer

Results 65 comments of Joe Mayer

@cyk1337 The PR that was created for this bug will allow you to add `use_node_local_storage=True` to the checkpointing section of the ds_config. By adding this flag DeepSpeed will save checkpoints...

@jeffra Is this PR still relevant? If so I can revive these changes with the current master branch.

@tjruwase Are these changes we still want? If so I can revive them with the current develop branch.

@Aillian Based on the screenshot it looks to be stuck downloading the Llama model. Did the download every complete?

It gets past the stuck point on a single GPU?

@kai-0430 Can you provide the output of `nvidia-smi topo -m`

This seems to be a systems issue. If you run without DeepSpeed does the hang also occur?

@Heathcliff-Zhao I am struggling to repro your code. ![image](https://github.com/microsoft/DeepSpeed/assets/114769929/cefd8a5d-c46f-4a0a-94bf-a5bf79bac924)

@Heathcliff-Zhao there are no tokenizer files in it.

@Heathcliff-Zhao What is command to repro with a fresh checkout of your repo? `--model_name_or_path ./opensource` is incorrect because that directory does not exist and specifying `--model_name_or_path THUDM/chatglm3-6b` to download the...