Haichen Huang issues

Results 11 issues of


Haichen Huang

[BUG]: Hybrid kernel went wrong when there were both fp16 and fp32 gradients

### 🐛 Describe the bug When model has both fp16 gradient and fp32 gradient, hybrid adam may unable to update parameters correctly. Since we put all parameters to a list...

bug

[colotensor] add megatron example via colotensor

[moe] refactor routers and fix moe bugs with activation checkpoint

- refactor moe routers - fix moe bugs with activation checkpoint

[feature] new zero implementation

# _A New ZeRO Implementation_ ## Backgrounds In the current version, our ZeRO has a performance issue. The reason is that our asymmetric distribution of chunks makes one process hinder...

[zero] migrate zero1&2

# What's New ZeRO1 and ZeRO2 optimizer is added. Here are something to do next. * correct `clip_grad_norm` with model and pipeline parallelism * test training efficiency

Run Build and Test

[zero] fix memory leak for zero2

Run Build and Test

[zero] test gradient accumulation

Run Build and Test

[zero] fix unit-tests

Run Build and Test

[hotfix] fix implement error in diffusers

### What's New Fix `NotImplementedError: Some torch function is incompatible because of its complcated inputs.` when training diffusers. * add a ignore step for no grad tensors * change the...

[zero] add inference mode and its unit test