DeepSpeed [CPU support] Optionally bind each rank to different cores on host

This PR add two command line options to deepspeed command to help support CPU as virtual accelerator, utilize vector or tensor computation provided by processor with AVX2/AVX512/AMX instruction set, which is being experimented at https://github.com/intel/intel-extension-for-deepspeed/tree/cpu-backend#cpu

This PR allows user to bind each rank to different CPU by add an option --bind_cores_to_rank. When this command line option is supplied, all cores on host will be evenly distributed between ranks. For example, if host has 2 sockets, each sockets has 20 cores. When we launch two DeepSpeed instances with deepspeed command like the following, core 0-19 will be assigned to the first instance, core 20-39 will be assigned to the second instance. In addition to that, OMP_NUM_THREADS environment variable will be set to the number of cores assigned to each rank for each DeepSpeed instance.

deepspeed --num_gpus 2 --bind_cores_to_rank <launch command>

This option is useful when using CPU to inference and the host has multiple NUMA domain or sub-NUMA domain, by select the --num_gpu number the same as number of NUMA domain, each DeepSpeed instance will be able to access memory within the NUMA domain, thus have better latency to host memory. Besides, by setting OMP_NUM_THREADS to the same number as core count per rank, we avoid over/under subscribing of OMP threads used by Pytorch.

The second option --bind_core_list allows user to specify a subset of cores on the system with comma separated numbers and range, such as the following. This allows user to reserve some cores in the system for other process, avoid performance impact to other process.

deepspeed --num_gpus 2 --bind_cores_to_rank --bind_core_list 0-18,20-38 <launch command>

When using CPU accelerator, we can combine these options with tensor parallel in inference to accelerate workload with multi-socket system, utilize the compute power of both socket for LLM inference and improve performance.

By default, these options are turned off, user can choose to turn the first on when using CPU accelerator, and turn the second on when user want fine control of core utilization.

These options are put in common part of DeepSpeed, rather than deepspeed backend for two reasons:

Core binding needs to be implemented before creating of each subprocess, so the best place is in launcher.
Bind workload process to core list is a common technique to improve efficiency, reduce cache miss and scheduling cost, so we propose these options to be available for GPU as well.

Without this PR, when using CPU as virtual accelerator, we might encounter the following issues:

not able to bind different DeepSpeed instance to different set of cores, so they kind of compete for compute cores and the performance will be impacted.
User needs to know CPU specific technique such as set OMP_NUM_THREADS for Pytorch OP to get better performance on CPU, now this can be done automatically.

Feb 23 '23 13:02 delock

--bind_cores_to_rank had been changed to 'store_true' so behavior is same as other boolean parameters.

Mar 02 '23 14:03 delock

When calling numactl, if all cores specified by -C belongs to same NUMA domain X, would add -m X to numactl to bind memory allocation as well. Observe slight perfomance improvement for CPU inference on a 2 socket machine.

We will have access to a machine with sub numa clustering (SNC), will run numactl --hardware to see if the numa parsing logic is the same. Currently we use numactl --hardware and parse the output to check how many numa domains the machine has and the list of cores belong to each numa domain. The function get_numa_cores parse this output and returns a list of core list of each numa domain.

(dscpu) $ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 1
32 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
 164 165 166 167
node 0 size: 515532 MB
node 0 free: 152936 MB
node 1 cpus: 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 2
14 215 216 217 218 219 220 221 222 223
node 1 size: 516017 MB
node 1 free: 135459 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Mar 11 '23 03:03 delock

The behavior is the same on machine with sub numa clustering (SNC), so the -m X would work for both multi-socket or machine with SNC.

When calling numactl, if all cores specified by -C belongs to same NUMA domain X, would add -m X to numactl to bind memory allocation as well. Observe slight perfomance improvement for CPU inference on a 2 socket machine.

We will have access to a machine with sub numa clustering (SNC), will run numactl --hardware to see if the numa parsing logic is the same. Currently we use numactl --hardware and parse the output to check how many numa domains the machine has and the list of cores belong to each numa domain. The function get_numa_cores parse this output and returns a list of core list of each numa domain.
(dscpu) $ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 1
32 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
 164 165 166 167
node 0 size: 515532 MB
node 0 free: 152936 MB
node 1 cpus: 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 2
14 215 216 217 218 219 220 221 222 223
node 1 size: 516017 MB
node 1 free: 135459 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Mar 14 '23 09:03 delock