Olatunji Ruwase comments

Results 644 comments of


                                            Olatunji Ruwase

Does Zero-Inference support TP?

This is because your model has not been pre-processed by a TP framework like Megatron. ZeRO-Inference will not perform the TP slicing on any model.

Does Zero-Inference support TP?

> Thanks! But how can I make it work? Do you have example command? Below are commands for single-gpu inference with kv-cache-offload. https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/zero_inference#token-generation-with-zero-inference

Does Zero-Inference support TP?

Glad that kv-cache-offload performance might be good for your scenario. Yes, you are correct there is no official implementation of TP + ZeRO-Inference + KV Offload. Unfortunately, we don't have...

Cannot load the previous model weights when using ZeRO 3 optimizer in DeepSpeed Chat

@caoyu-noob, you can use the `zero_to_fp32.py` script to convert the zero3 checkpoints into a regular pytorch checkpoint. You can find documentation of this script and other checkpoint conversion options [here](https://www.deepspeed.ai/tutorials/zero/#extracting-weights).

Why is the shape of rm model all 0

@Pattaro, this happens with parameter partitioning of zero stage 3. The parameters will be fetched on-demand before use, so no reason for alarm. Are you seeing any training issues otherwise?

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6

@Rainbowman0, to help with further investigation, can you try running the micro benchmarks here https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication

Olatunji Ruwase

Does Zero-Inference support TP?

Does Zero-Inference support TP?

Does Zero-Inference support TP?

Cannot load the previous model weights when using ZeRO 3 optimizer in DeepSpeed Chat

Why is the shape of rm model all 0

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6

deepspeed-chat: print mean stage1/2 loss periodically

[BUG] Zero2 offload overflow

[BUG] Zero2 offload overflow

[BUG] Zero2 offload overflow