Ma, Guokai comments

Results 180 comments of


                                            Ma, Guokai

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

I build a benchmark that test CPUAdam performance seperately. The tensor size is to simulate Qwen2.5-3B model running on multiple cards. ``` import torch import deepspeed from deepspeed.ops.op_builder import CPUAdamBuilder...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

@Antlera there is a way to know which two cores are virtual core of the same physical core `lscpu --extended` The result would list out core ID of each CPU....

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

If I understand correctly, zenflow optimizer runs in a seperate process. So this is the process that needs core binding. However this also makes it running in parallel with pytorch,...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

@Antlera I gave CPUAdam benchmark an update. We defined a problem statement as "Given 3B parameters on CPU memory, how to update them as fast as possible." Given this problem...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

Hi @Antlera . Some of my thoughts on CPU affinity for DeepSpeed+ZenFlow 1. Each ZenFlow optimizer worker needs to run on seperate set of physical CPU cores. 2. OMP_NUM_THREADS needs...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

I created a branch that implementes core seperation between ZenFlow workers and DeepSpeed workers. https://github.com/deepspeedai/DeepSpeed/tree/gma/zenflow_affinity I use Qwen2.5-3B to test the performance. Finetune 50 steps then compute average step time...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

@Antlera from the logging with #7506 I observed the following: 1. In steps with update, bwd_microstep: 1695.09 is longer. Is there explaination of longer bwd_microstep? 2. optimizer_transmit_time is 470ms in...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

Hi @Antlera here is the command and config file I used: ``` deepspeed --bind_cores_to_rank --num_gpus=2 finetune_llama.py --model_name Qwen/Qwen2.5-3B --output_dir output --lr 2e-5 --batch_size 8 --deepspeed_config zf_config.json --num_train_epochs 1 ``` ```...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

@Antlera Thanks for this very detailed analysis! It gives good suggestion on what should be default value. Maybe make `ds_core_num` bigger when there are aboundant number of cores would be...

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

Thanks! I see optimizer_receive_params_time is very small, so in my case the optimizer time should be almost fully overlapped. Thanks for the explaination! > Hi [@delock](https://github.com/delock). For your logs and...