Ma, Guokai comments

Results 180 comments of


                                            Ma, Guokai

[BUG] OOM when train 70B models using deepspeed 0.16.4

I looked at the [place](https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/zero/partitioned_param_coordinator.py#L446-L454) where `__n_available_params` is set to zero. The loop will release params and will decrease this variable accordingly. So to speak if the loop didn't release...

[BUG] OOM when train 70B models using deepspeed 0.16.4

@loadams I have talked to @wenbinc-Bin offline, he agreed to rollback his PR first to address this issue. Then he would submit a more comprehensive fix on OOM issue. I'll...

Introduce a device-agnostic runtime API design

When porting an existing CUDA specific code with this API, a good practice is to make the code backward compatible. i.e. ``` import torch if hasattr(torch, 'cuda'): import torch.cuda as...

Introduce a device-agnostic runtime API design

> @delock , good point. May I know if you are proposing a practice or a feature that we need to implement on the torch side? Its a practice in...

Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models

Hi @gyou2021 I like the goal to avoid repetition of same logic from L296 to L315, but I also have concern that models enabled by these lines will not be...

Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models

@loadams let me check with gyou on this PR status.

Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models

@loadams my questions are all resolved and I have no further question to @gyou2021 , thanks!

Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models

Hi @loadams , is this PR under review? Thanks!

Dose autotp supports the multimodal training?

Hi @jclyu123-beep , thanks for asking. Autotp analysis model architecture to figure out a model shards between cards. In this process autotp use module name (i.e. proj_q, proj_k) as a...

[REQUEST] Muon Optimizer - Different LR for Different Groups

The problem statement is when optimizer is marked as "muon", actually both adam and muon optimizer will be used -- muon for >2D weights for hidden layers, adam for the...