set cpu affinity and membind for better oob performance
sets num of threads and cpu affinity to the number of physical cpus in a socket and performs memory binding to single numa node for better oob performance
@sywangyi @jiqing-feng please review
Hi, generally, I think such logic should not be put in model code. it should be in framework code like transformers. accelerator, deepspeed... also model code should be adapted to variant usage like autoTP for inference. if ipex model is running for 2 TP, this logic is wrong in that case. @yao-matrix
Hi, generally, I think such logic should not be put in model code. it should be in framework code like transformers. accelerator, deepspeed... also model code should be adapted to variant usage like autoTP for inference. if ipex model is running for 2 TP, this logic is wrong in that case. @yao-matrix
Hi @rbrugaro . The issue still exists. I think you should check the world size (rank number) first to know how many processes you have, and then you can choose the best binding strategy for each rank.
We plan to integrate the best strategy of binding cores in ipex backend, so other users can call this function to get the best performance of IPEXModel on CPU. cc @echarlaix
@jiqing-feng @sywangyi I updated code to read world_size and rank from environment variables and set good default settings only if unset by user. 50% oob improvement in generation time in 4th Gen Intel(R) Xeon(R) for 1 process (gpt2) 27% oob improvementt in generation time in 4th Gen Intel(R) Xeon(R) for 2 process (gpt2) Note: No oob improvement seen when 2 process are launched with accelerate launcher.
1 process:
setting OMP_NUM_THREADS to 56
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55}, membind = {0}
2 process:
setting OMP_NUM_THREADS to 56
setting OMP_NUM_THREADS to 56
affinity={56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111}, membind = {1}
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55}, membind = {0}
Note: if the multi process is launched with accelerate launch text_gen2.py , accelerate launcher sets defaults for number of threads and affinity to number of logical cores and didn't see any improvement. Maybe accelerate OOB defaults need to be updated?
$ accelerate launch text-gen2.py
2024-08-07 18:29:35,280 - launch.py - accelerate.commands.launch - WARNING - The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `112` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
My guessed rank = 1
My guessed rank = 0
OMP_NUM_THREADS already set to 112
OMP_NUM_THREADS already set to 112
affinity={56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223}, membind = {1}
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167}, membind = {0}
Usually, we only bind physical cores which are 0-55 and 56-111 in your case. The memory bind is also required. It's weird that accelerate launch has no speed-up compared to the default Python command. Do you have any clue about it? cc @sywangyi
@jiqing-feng @rbrugaro , this could only happened if local_world_size in accelerate is equal to 1.https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/environment.py#L213. https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L239 seems MPI is not launched in the env. Have you followed https://github.com/huggingface/accelerate/blob/main/docs/source/usage_guides/ipex.md?
@sywangyi @jiqing-feng Turned out to be a bug in accelerate that was setting the wrong num_cpu_threads_per_process when the script was launched with accelerate launch myscript.py because it was reading an unset env variable. It was working OK when launched with mpirun -f hostfile -n 4 -ppn 2 accelerate launch myscript.py
I have opened an accelerate PR#3009 for it
Now I can see 15% speed up from numa bindings.
@sywangyi @jiqing-feng @echarlaix I have addressed the comments and accelerate PR#3009 has been merged. please review
Hi @IlyasMoutawwakil . Could you merge this PR? Thx!
@sywangyi @jiqing-feng I pushed one commit that remained in my local. Results above correspond to this. I double checked. example: 2 nodes, 4 process rank_per_node=2 rank_id=0 node_id = int((rank_id+1) / rank_per_node) = 0
rank_id=1 node_id = int((rank_id+1) / rank_per_node) = 1
numa.set_membind([node_id])
Quote reply
in previous logic, if 2nodes, 4 process in one dev rank_id, 0, 1, node_id 0 rank_id 2,3, node_id 1
in your modified logic. 2nodes, 4 process in one dev rank 0, node 0 rank 1, node 1 rank 2, node 1. rank 3, node 2?
You are correct @sywangyi, I had missed it because i was testing with 2 machines, 2 process per node. I've changed it so the numa node_id is calculated from local_size not world size across machines. I have tested with 2 nodes and num_processes:2,4,8 and numa affinity and binding works as expected.
@IlyasMoutawwakil thanks for the comments, please check if I have addressed those properly.
This utility assigns a default system configuration that works well for most common scenarios. However, it may not be optimal for every situation or machine configuration, as performance can vary depending on the model, task, and other concurrent processes on the host.
Users who need more granular control will continue to use numactl numa-args python script.py args or
python -m torch.backends.xeon.run_cpu xeon-args script.py args
What we find is that some users may not use numatcl or configure the environment and they miss on performance. The utility is designed so that any environment/system configuration by the user takes precedence but if the user does not specify the settings, the utility will apply a default configuration that improves the oob performance.
Specifically, the utility sets OMP_NUM_THREADS and torch.num_threads to an equal split of the available physical cores, divided across the ranks running on the machine and performs NUMA affinity and memory binding to avoid the overhead of accessing memory on remote NUMA nodes.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@IlyasMoutawwakil i fixed the last style/quality issue and checks are passing. if you think is ready pls merge. thx for all the help!