[TRACKER] Customer support related PR tracker for Intel devices

Open delock opened this issue 1 year ago • 0 comments

This issue acted as a PR tracker to Intel customer support related PRs. The purpose is to get understanding of what each PR does and how important are they compared to other customer support related PRs. This also help us to aware of merged PRs and PRs progress.

Under review

[ ] sequence parallel for uneven heads: https://github.com/microsoft/DeepSpeed/pull/6392 (Open)
[ ] Enabled Qwen2-MoE Tensor Parallelism (TP) inference: https://github.com/microsoft/DeepSpeed/pull/6551 (Open)
[ ] Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models: https://github.com/microsoft/DeepSpeed/pull/6553 (Open)

Already merged

o MoE

[x] support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix: https://github.com/microsoft/DeepSpeed/pull/5259
[x] Fix a convergence issues in TP topology caused by incorrect grad_norm: https://github.com/microsoft/DeepSpeed/pull/5411
[x] add moe topk(k>2) gate support: https://github.com/microsoft/DeepSpeed/pull/5881
[x] reduce cpu host overhead when using moe: https://github.com/microsoft/DeepSpeed/pull/5578

o Ulysess

[x] fix sequence parallel(Ulysses) grad scale for zero0: https://github.com/microsoft/DeepSpeed/pull/5555
[x] sequence parallel with communication overlap: https://github.com/microsoft/DeepSpeed/pull/5691

o AutoTP

[x] autoTP for fused qkv weight: https://github.com/microsoft/DeepSpeed/pull/3844
[x] autoTP for Qwen: https://github.com/microsoft/DeepSpeed/pull/4902

o Accelerator Graph

[x] Capture short kernel sequences to graph: https://github.com/microsoft/DeepSpeed/pull/4318

o ZeRO

[x] params partition for skip_init: https://github.com/microsoft/DeepSpeed/pull/4722

o Others

[x] skip bcast when enable pp but pp_group_size=1: https://github.com/microsoft/DeepSpeed/pull/3915
[x] remove duplicate check for pp and zero stage: https://github.com/microsoft/DeepSpeed/pull/4033
[x] update ut/doc for glm/codegen: https://github.com/microsoft/DeepSpeed/pull/4057
[x] do allgather only in shared optimizer states groups: https://github.com/microsoft/DeepSpeed/pull/4167
[x] use non_reentrant_checkpoint fix requires_grad of input must be true for activation checkpoint layer in pipeline train.: https://github.com/microsoft/DeepSpeed/pull/4224
[x] clear redundant parameters in zero3 bwd hook: https://github.com/microsoft/DeepSpeed/pull/4520
[x] set the default to use set_to_none for clearing gradients in BF16 optimizer.: https://github.com/microsoft/DeepSpeed/pull/5434
[x] Use deepspeed.comm instead of torch.distributed: https://github.com/microsoft/DeepSpeed/pull/5225
[x] Use torch.nan_to_num replace numpy wrapper one: https://github.com/microsoft/DeepSpeed/pull/5877
[x] [bugfix] promote state in bf16_optimizer: https://github.com/microsoft/DeepSpeed/pull/5767

Sep 20 '24 06:09 delock