Ma, Guokai comments

Results 180 comments of


                                            Ma, Guokai

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

> [@Antlera](https://github.com/Antlera) Thanks for this very detailed analysis! It gives good suggestion on what should be default value. Maybe make `ds_core_num` bigger when there are aboundant number of cores would...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

> [@delock](https://github.com/delock) [@sfc-gh-truwase](https://github.com/sfc-gh-truwase) Some thoughts on the auto-tuning feature. Personally, I’d lean toward a simple script that runs a dummy model to stress the CPU side. Since the main goal...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

> [@delock](https://github.com/delock) Did a very quick test in the slurm setting. It looks like the current soft fallback still has issues under Slurm. For example, I requested 32 CPU cores,...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

I tried to emulate this situation by the following command: ``` taskset -c 0,4-11,21-26,30-46 deepspeed --num_gpus=2 finetune_llama.py --model_name Qwen/Qwen2.5-3B --output_dir output --lr 2e-5 --batch_size 8 --deepspeed_config zf_config.json --num_train_epochs 1 ```...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

Thanks for the detail @Antlera . Let me read slurm docs to see if I can see any clue. If not, then lets add `taskset` as a pratical hint.

[Roadmap] DeepSpeed Roadmap Q1 2026

Does "DeepSpeed backend integration as the training engine for verl" means to be the default training engine for verl?

[AutoTP + DS2] no memory reduction when using auto = 4

> https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/#supported-models, it looks like the qwen2.5 is not in the supported model list > > [@delock](https://github.com/delock), FYI Just verified that AutoTP support Qwen2.5-7B, the list should be updated. Will...

[AutoTP + DS2] no memory reduction when using auto = 4

> I trained model qwen2.5-7B in GPU with llamafactory deepspeed zero2 + autotp. And there is no obvious memory reduction. > > When individually using zero2, the average memory of...

[REQUEST] Allow fallback to torch.distributed.all_reduce for multi-node inference.

Hi @phalani-paladugu thanks for the suggestion. Agree that fallback with multinode should be added to support multi-node inference. For single node SHM, I notice that there are [RISCV implementation](https://github.com/deepspeedai/DeepSpeed/blob/b7cd78f096016ae67a11ef6292eba28e0452b4e7/csrc/cpu/comm/riscv64/shm.h), is...

[BUG] Problem with CPUAdam compilation on AMD CPUs

The implementation of `cpu_arch()` tends to return `-march=native`, so `x86-64-v3` looks abnormal to me. Hi @Ali-Sayed-Salehi , some debugging into this function should reveal the exact line that returns `-march=x86-64-v3`,...