optimum-intel icon indicating copy to clipboard operation
optimum-intel copied to clipboard

set cpu affinity and membind for better oob performance

Open rbrugaro opened this issue 1 year ago • 2 comments

sets num of threads and cpu affinity to the number of physical cpus in a socket and performs memory binding to single numa node for better oob performance

rbrugaro avatar Jul 29 '24 22:07 rbrugaro

@sywangyi @jiqing-feng please review

rbrugaro avatar Jul 29 '24 22:07 rbrugaro

Hi, generally, I think such logic should not be put in model code. it should be in framework code like transformers. accelerator, deepspeed... also model code should be adapted to variant usage like autoTP for inference. if ipex model is running for 2 TP, this logic is wrong in that case. @yao-matrix

sywangyi avatar Jul 30 '24 01:07 sywangyi

Hi, generally, I think such logic should not be put in model code. it should be in framework code like transformers. accelerator, deepspeed... also model code should be adapted to variant usage like autoTP for inference. if ipex model is running for 2 TP, this logic is wrong in that case. @yao-matrix

Hi @rbrugaro . The issue still exists. I think you should check the world size (rank number) first to know how many processes you have, and then you can choose the best binding strategy for each rank.

jiqing-feng avatar Aug 06 '24 02:08 jiqing-feng

We plan to integrate the best strategy of binding cores in ipex backend, so other users can call this function to get the best performance of IPEXModel on CPU. cc @echarlaix

jiqing-feng avatar Aug 06 '24 02:08 jiqing-feng

@jiqing-feng @sywangyi I updated code to read world_size and rank from environment variables and set good default settings only if unset by user. 50% oob improvement in generation time in 4th Gen Intel(R) Xeon(R) for 1 process (gpt2) 27% oob improvementt in generation time in 4th Gen Intel(R) Xeon(R) for 2 process (gpt2) Note: No oob improvement seen when 2 process are launched with accelerate launcher.

1 process:

setting OMP_NUM_THREADS to 56
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55}, membind = {0}

Screenshot 2024-08-07 at 11 32 04 AM

2 process:

setting OMP_NUM_THREADS to 56
setting OMP_NUM_THREADS to 56
affinity={56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111}, membind = {1}
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55}, membind = {0}

Screenshot 2024-08-07 at 11 33 31 AM

Note: if the multi process is launched with accelerate launch text_gen2.py , accelerate launcher sets defaults for number of threads and affinity to number of logical cores and didn't see any improvement. Maybe accelerate OOB defaults need to be updated?

$ accelerate launch text-gen2.py 
2024-08-07 18:29:35,280 - launch.py - accelerate.commands.launch - WARNING - The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `112` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
My guessed rank = 1
My guessed rank = 0
OMP_NUM_THREADS already set to  112
OMP_NUM_THREADS already set to  112
affinity={56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223}, membind = {1}
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167}, membind = {0}

Screenshot 2024-08-07 at 11 30 12 AM

rbrugaro avatar Aug 07 '24 18:08 rbrugaro

Usually, we only bind physical cores which are 0-55 and 56-111 in your case. The memory bind is also required. It's weird that accelerate launch has no speed-up compared to the default Python command. Do you have any clue about it? cc @sywangyi

jiqing-feng avatar Aug 08 '24 03:08 jiqing-feng

@jiqing-feng @rbrugaro , this could only happened if local_world_size in accelerate is equal to 1.https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/environment.py#L213. https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L239 seems MPI is not launched in the env. Have you followed https://github.com/huggingface/accelerate/blob/main/docs/source/usage_guides/ipex.md?

sywangyi avatar Aug 12 '24 00:08 sywangyi

@sywangyi @jiqing-feng Turned out to be a bug in accelerate that was setting the wrong num_cpu_threads_per_process when the script was launched with accelerate launch myscript.py because it was reading an unset env variable. It was working OK when launched with mpirun -f hostfile -n 4 -ppn 2 accelerate launch myscript.py I have opened an accelerate PR#3009 for it Now I can see 15% speed up from numa bindings.

rbrugaro avatar Aug 13 '24 07:08 rbrugaro

@sywangyi @jiqing-feng @echarlaix I have addressed the comments and accelerate PR#3009 has been merged. please review

rbrugaro avatar Aug 15 '24 16:08 rbrugaro

Hi @IlyasMoutawwakil . Could you merge this PR? Thx!

jiqing-feng avatar Aug 20 '24 04:08 jiqing-feng

@sywangyi @jiqing-feng I pushed one commit that remained in my local. Results above correspond to this. I double checked. example: 2 nodes, 4 process rank_per_node=2 rank_id=0 node_id = int((rank_id+1) / rank_per_node) = 0

rank_id=1 node_id = int((rank_id+1) / rank_per_node) = 1

numa.set_membind([node_id])

rbrugaro avatar Aug 20 '24 05:08 rbrugaro

Quote reply

in previous logic, if 2nodes, 4 process in one dev rank_id, 0, 1, node_id 0 rank_id 2,3, node_id 1

in your modified logic. 2nodes, 4 process in one dev rank 0, node 0 rank 1, node 1 rank 2, node 1. rank 3, node 2?

sywangyi avatar Aug 20 '24 06:08 sywangyi

You are correct @sywangyi, I had missed it because i was testing with 2 machines, 2 process per node. I've changed it so the numa node_id is calculated from local_size not world size across machines. I have tested with 2 nodes and num_processes:2,4,8 and numa affinity and binding works as expected.

rbrugaro avatar Aug 20 '24 07:08 rbrugaro

@IlyasMoutawwakil thanks for the comments, please check if I have addressed those properly.

This utility assigns a default system configuration that works well for most common scenarios. However, it may not be optimal for every situation or machine configuration, as performance can vary depending on the model, task, and other concurrent processes on the host. Users who need more granular control will continue to use numactl numa-args python script.py args or python -m torch.backends.xeon.run_cpu xeon-args script.py args

What we find is that some users may not use numatcl or configure the environment and they miss on performance. The utility is designed so that any environment/system configuration by the user takes precedence but if the user does not specify the settings, the utility will apply a default configuration that improves the oob performance.

Specifically, the utility sets OMP_NUM_THREADS and torch.num_threads to an equal split of the available physical cores, divided across the ranks running on the machine and performs NUMA affinity and memory binding to avoid the overhead of accessing memory on remote NUMA nodes.

rbrugaro avatar Aug 21 '24 22:08 rbrugaro

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@IlyasMoutawwakil i fixed the last style/quality issue and checks are passing. if you think is ready pls merge. thx for all the help!

rbrugaro avatar Aug 27 '24 16:08 rbrugaro