Olatunji Ruwase
Olatunji Ruwase
This is because your model has not been pre-processed by a TP framework like Megatron. ZeRO-Inference will not perform the TP slicing on any model.
> Thanks! But how can I make it work? Do you have example command? Below are commands for single-gpu inference with kv-cache-offload. https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/zero_inference#token-generation-with-zero-inference
Glad that kv-cache-offload performance might be good for your scenario. Yes, you are correct there is no official implementation of TP + ZeRO-Inference + KV Offload. Unfortunately, we don't have...
@caoyu-noob, you can use the `zero_to_fp32.py` script to convert the zero3 checkpoints into a regular pytorch checkpoint. You can find documentation of this script and other checkpoint conversion options [here](https://www.deepspeed.ai/tutorials/zero/#extracting-weights).
@Pattaro, this happens with parameter partitioning of zero stage 3. The parameters will be fetched on-demand before use, so no reason for alarm. Are you seeing any training issues otherwise?
@Rainbowman0, to help with further investigation, can you try running the micro benchmarks here https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication
@mosheisland, apologies for the delay in merging this PR. Can you please help resolve the conflicts? Thanks!
@QingtaoLi1, are you able to provide full repro steps?
@SwayamInSync, @Smu-Tan, @QingtaoLi1, @Kamichanw, @AceMcAwesome77, @desire2020 please try #6976
> One quick fix, worked in my case setting `overlap_comm` to false @SwayamInSync, can you please share your repro to help us debug why `overlap_comm` is triggering this issue? Thanks!