Joe Mayer comments

Results 65 comments of


                                            Joe Mayer

[BUG] Can't load checkpoint without having shared filesystem in multi-node training when multi-node setup config remains same

@cyk1337 The PR that was created for this bug will allow you to add `use_node_local_storage=True` to the checkpointing section of the ds_config. By adding this flag DeepSpeed will save checkpoints...

[zero-3] add support for new params added during fwd pass

@jeffra Is this PR still relevant? If so I can revive these changes with the current master branch.

Share a list of weight attributes instead of a single one in TiedLayerSpec API

@tjruwase Are these changes we still want? If so I can revive them with the current develop branch.

[BUG] Training gets stuck when model starts training

@Aillian Based on the screenshot it looks to be stuck downloading the Llama model. Did the download every complete?

[BUG] Training gets stuck when model starts training

It gets past the stuck point on a single GPU?

[BUG] DeepSpeed hangs during evaluation under multi-GPU

@kai-0430 Can you provide the output of `nvidia-smi topo -m`

[BUG] DeepSpeed hangs during evaluation under multi-GPU

This seems to be a systems issue. If you run without DeepSpeed does the hang also occur?

[BUG] Tensors are on different devices when model.step()

@Heathcliff-Zhao I am struggling to repro your code. ![image](https://github.com/microsoft/DeepSpeed/assets/114769929/cefd8a5d-c46f-4a0a-94bf-a5bf79bac924)

[BUG] Tensors are on different devices when model.step()

@Heathcliff-Zhao there are no tokenizer files in it.

[BUG] Tensors are on different devices when model.step()

@Heathcliff-Zhao What is command to repro with a fresh checkout of your repo? `--model_name_or_path ./opensource` is incorrect because that directory does not exist and specifying `--model_name_or_path THUDM/chatglm3-6b` to download the...