Tingfeng Lan
Tingfeng Lan
## Benchmark on CPU binding methods Hi @delock. Thanks for comfirming the command. Please see my benchmark for CPU core binding and overhead breakdown. The overhead is shown as the...
Another quick comment for the potential bugs. The default `--bind_to_rank` implementation using `numactl` can be problematic for Slurm users, since they only have access rights to a subset of cores...
Hi @delock. For your logs and questions. The transmit throughput looks a bit slow here, only around 12 GB/s. From my side I usually see ~200 ms for this stage...
> > [@Antlera](https://github.com/Antlera) Thanks for this very detailed analysis! It gives good suggestion on what should be default value. Maybe make `ds_core_num` bigger when there are aboundant number of cores...
@delock Thanks for implementing the soft fallback in (#7506). I’ll run a quick test on it soon.
@delock @sfc-gh-truwase Some thoughts on the auto-tuning feature. Personally, I’d lean toward a simple script that runs a dummy model to stress the CPU side. Since the main goal is...
@delock Did a very quick test in the slurm setting. It looks like the current soft fallback still has issues under Slurm. For example, I requested 32 CPU cores, but...
Maybe for the fallback case it would be safer to base the core split on the CPUs visible to the current process, (e.g. `num_cores = len(psutil.Process().cpu_affinity()`) instead of relying on...
@delock I used `deepspeed --num_gpus=$GPUS_PER_NODE --master_port $MASTER_PORT finetune_llama.py`. Let me double check I am at the right branch head.
I am currently at commit `744399e` [Merge branch 'master' into gma/zenflow_affinity](https://github.com/deepspeedai/DeepSpeed/pull/7506/commits/744399e096313e7f0eb18026e503ce3a6cf81829). > [@delock](https://github.com/delock) I used `deepspeed --num_gpus=$GPUS_PER_NODE --master_port $MASTER_PORT finetune_llama.py`. Let me double check I am at the right branch...