Olatunji Ruwase
Olatunji Ruwase
> Still no luck, @tjruwase . Is there a known issue with CUDA OOM in DeepSpeed tests? I only added a few layers to SimpleMoEModel that is not that widely...
The `offload_fp32_gradients` that are swapped out are the [fp32 versions](https://github.com/microsoft/DeepSpeed/blob/c88af2143248e4655d401f9231317f3c76018057/deepspeed/runtime/zero/stage3.py#L1247) of the gradients needed for `optimizer.step()` computation. On the other hand, `self.__param_id_to_grad_partition` holds 16-bit gradients.
The swap mode is enabled when optimizer state is offloaded to nvme because both GPU and CPU memory are too small. In that case, there is little benefit to keeping...
@platoonpluto, thanks for the clarification. Yes, your observation is correct, we could avoid since the swap overhead since the 16-bits gradients are always available in `self.__param_id_to_grad_partition`. One reason for my...
@Yejing-Lai, please help resolve conflict.
> Hi @tjruwase is this PR under review state or merge state? We are working on Intel Extension for PyTorch release and want to know whether this PR will be...
@jpatel-bdai, all zero stages are expected to match ddp on single gpu runs. So, it appears that you are hitting bugs in zero. Are you able to share detailed steps...
Ideally, we expect zero stages to match ddp in multi-gpu runs, since zero is designed to be a memory-efficient ddp algorithm. In terms of debugging, a first step would be...
@Orion-Zheng, are you still having this issue?
> Some more background: I'm working on the RWKV project, a fork, where they save the weights with a copy of `zero_to_fp32.py`. @freckletonj, apologies for the delayed response here. Is...