Ma, Guokai
Ma, Guokai
I looked at the [place](https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/zero/partitioned_param_coordinator.py#L446-L454) where `__n_available_params` is set to zero. The loop will release params and will decrease this variable accordingly. So to speak if the loop didn't release...
@loadams I have talked to @wenbinc-Bin offline, he agreed to rollback his PR first to address this issue. Then he would submit a more comprehensive fix on OOM issue. I'll...
When porting an existing CUDA specific code with this API, a good practice is to make the code backward compatible. i.e. ``` import torch if hasattr(torch, 'cuda'): import torch.cuda as...
> @delock , good point. May I know if you are proposing a practice or a feature that we need to implement on the torch side? Its a practice in...
Hi @gyou2021 I like the goal to avoid repetition of same logic from L296 to L315, but I also have concern that models enabled by these lines will not be...
@loadams let me check with gyou on this PR status.
@loadams my questions are all resolved and I have no further question to @gyou2021 , thanks!
Hi @loadams , is this PR under review? Thanks!
Hi @jclyu123-beep , thanks for asking. Autotp analysis model architecture to figure out a model shards between cards. In this process autotp use module name (i.e. proj_q, proj_k) as a...
The problem statement is when optimizer is marked as "muon", actually both adam and muon optimizer will be used -- muon for >2D weights for hidden layers, adam for the...