[Aurora] GPU pipelining failure/performance degradation
When using mpich/opt/develop-git.6037a7a on Aurora, I notice that the following test crashes for large message size (~64MB) when GPU pipelining is turned on. For smaller message sizes, there is a performance regression.
mpich/opt/4.2.3-intel does not seem to have this issue.
This is the performance upto 4MB using the two versions:
| message size | GPU PIPLN/ develop-git.6037a7a (MB/s) | GPU PIPLN/4.2.3-intel (MB/s) |
|---|---|---|
| 1 | 1 | 1.61 |
| 2 | 1.59 | 3.23 |
| 4 | 3.19 | 6.45 |
| 8 | 6.38 | 12.96 |
| 16 | 12.78 | 25.89 |
| 32 | 25.57 | 51.75 |
| 64 | 50.78 | 100.14 |
| 128 | 36.33 | 123.29 |
| 256 | 94.03 | 121.6 |
| 512 | 98.78 | 134.72 |
| 1024 | 106.68 | 143.59 |
| 2048 | 111.51 | 732.25 |
| 4096 | 113.87 | 1472.69 |
| 8192 | 109.69 | 2957.42 |
| 16384 | 110.34 | 5861.81 |
| 32768 | 107.82 | 11590.16 |
| 65536 | 21586.53 | 21417.23 |
| 131072 | 28338.11 | 34983.17 |
| 262144 | 30891.85 | 43363.4 |
| 524288 | 32059.07 | 46077.84 |
| 1048576 | 28726.47 | 46744.3 |
| 2097152 | 29452.7 | 47071.55 |
| 4194304 | 29858.49 | 47247.48 |
This is the test:
export FI_CXI_RDZV_THRESHOLD=131072
export EnableImplicitScaling=0
export NEOReadDebugKeys=1
export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
export MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1
# Enable GPU pipelining
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_THRESHOLD=0
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_NUM_BUFFERS_PER_CHUNK=4
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_MAX_NUM_BUFFERS=4
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_D2H_ENGINE_TYPE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_H2D_ENGINE_TYPE=1
mpiexec -np 4 -ppn 2 --cpu-bind list:2:15 ~/gpu_wrappers/2-2.sh $PATH_TO_OSU/pt2pt/osu_mbw_mr -m 1:67108864 -i 100 -x 20 -d ze D D
The wrapper script used here is
#!/bin/bash
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
if [ $PALS_LOCAL_RANKID -eq 0 ]
then
AFFINITY_MASK=0.0
NIC_NUM=cxi0
elif [ $PALS_LOCAL_RANKID -eq 1 ]
then
AFFINITY_MASK=1.0
NIC_NUM=cxi1
fi
echo "[I am rank $PALS_RANKID] Localrank=$PALS_LOCAL_RANKID : Affinity mask = $AFFINITY_MASK, PREFERRED_NIC = $NIC_NUM"
export ZE_AFFINITY_MASK=$AFFINITY_MASK
export FI_CXI_DEVICE_NAME=$NIC_NUM
# Invoke the main program
$*
@rithwiktom It will be impossible to suggest or configure or verify regressions if we have to use dozens of env variables that are non default. Please open a PR on what is Intel formally supported default options related to Intel GPUs for these knobs.
We should consider if setting MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1 by default is the right thing to do rather than require users (or modules) to set it on Aurora and other systems.
https://github.com/pmodels/mpich/pull/7516 should fix the crash.
Did you also check the performance?
We should consider if setting
MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1by default is the right thing to do rather than require users (or modules) to set it on Aurora and other systems.
It used to the case that using immediate command list causes some workload to crash. It seemed to be L0 issues, so this was left to be off by default. However, with the new UMD, the bug may have been fixed so we should reevaluate it again.
@zhenggb72 If would be good if there are such cases, some of these values aren't set that specific JIRAs are opened and addressed. We won't be able to track regression with non standard options.
Did you also check the performance?
I got ~ 15GB/sec. Haven't investigated further yet.
Yeah, this was many years ago, and it was some very complicated bug with the workload and it took probably almost a year for L0 team to track down. On the other hand, using regular command list can buffer the command and submit only once which actually may reduce the submission overheads. So there is pros and cons. Note that as L0 keep optimizing immediate cmd list, things may also have changed. That is why I suggest we can re-evaluate the default.
@rithwiktom The original pipeline algorithm will be replaced in https://github.com/pmodels/mpich/pull/7529. Could you test and evaluate the performance of PR7529? You need set MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD to enable the pipeline path in the new PR.
I suspect https://github.com/pmodels/mpich/pull/7168/commits/05883b6a6c652bb1cbdcd81f68fa34c9f27e0445 is the cause for the performance change, at least for the low to medium range. The jump at 32768 is indicative of MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE_D2H added in that commit
Hi @hzhou, I'm leaving Intel. Please reach out to Maria/Gengbin if you need anything.
Thanks for letting us know. Best wishes!
@rithwiktom The original pipeline algorithm will be replaced in #7529. Could you test and evaluate the performance of PR7529? You need set
MPIR_CVAR_CH4_OFI_EAGER_THRESHOLDto enable the pipeline path in the new PR.
Is this ready for testing?
Yes, the PR has been merged in to main.
Do we need to use any env variable to enable this code path? Could you give us the command to run this code path? You can see the env variables that Rithwik used in the comment above.
Set MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000000 to make all messages 1MB above to use the new ofi RNDV path. There are 4 RNDV protocols:
- pipeline - send packs to genq chunks and send chunk using
fi_send, receiver receives each chunk and unpacks to receive buffer - RDMA read - only works when sender does not require packing. Sender registers send buffer; receiver issues
fi_readto get the data. If the receive buffer requires packing, it reads into genq chunk buffer and unpacks each chunk similar to pipeline receive. - RDMA write - only works when receiver does not require packing. Receiver registers receive buffer and sender issues
fi_write. If sender requires packing, it packs to genq buffer and writes from chunks similar to pipeline send. - direct. Directly use
fi_send/fi_recvafter the initialRTS/CTS. Only works when both sender/receiver does not require packing.
Use MPIR_CVAR_CH4_OFI_RNDV_PROTOCOL to force rndv protocol (auto, pipeline, read, write, and direct). The default is auto, which selects the protocol based on whether sender or receiver requires (or prefers) packing. If HMEM is not enabled, then all gpu buffers will require packing. So for GPU to GPU, it will use the pipeline protocol by default.
If multiple nics is enabled (MPIR_CVAR_CH4_OFI_MAX_NICS), pipeline will be distributed over all available nics, achieving striping effect. However, I have not be able to scale the multiple nic performance on Aurora yet. Multi-bandwidth with multiple process on each node can scale perfectly.
MPIR_CVAR_CH4_OFI_PIPELINE_CHUNK_SZ and MPIR_CVAR_CH4_OFI_PIPELINE_NUM_CHUNKS controls the pipelining. RDMA read/write will use the same CVAR if it is using pipelined read/write.
MPIR_CVAR_CH4_OFI_GPU_{SEND,RECV}_ENGINE_TYPE controls the ZE copy engine (for pipeline packing). - (side question: What is the reason we differentiate send and recv?)
MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE{,_H2D,_D2H} controls the threshold for fast copy. The default definitely need adjustment for the current Aurora. See https://github.com/pmodels/mpich/pull/7541
@hzhou , I have two asks: a) I am going to ask Tom Musta to collect performance data. He is not an MPICH expert. Could you generate a binary for him so that he can test it and let him know the location? b) We need the exact env variables we need to test. I head is spinning after looking at your description. If needed, we can have a quick call and decide what to test. Let us start small and we can add more testing if needed.
On sunspot the branch (aurora_test) has been loaded as default, so he can directly test it there. @servesh is building the module for aurora as well. I think once he finish, it will be the default on aurora as well. @servesh please confirm.
He only need set MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000000 to perform the tests described in head of this issue. @servesh is the CVAR set in the lmod? If that's the case, then he won't need set anything.
Next PE update on Sunspot will have this set in the mpich lmod module. Focused on testing things on Sunspot ATM.
I do not have a sunspot account (at least I've never logged in) .... Hopefully it will make its way to Aurora as a preview after some testing on Sunspot. Or I can see if Chris C is able to pull modules from Sunspot back to Borealis.
@hzhou
@tommusta re-ran experiments in Aurora. Below is a summary and his comments.
As a summary, the old Intel provided code performs as expected and immediate command list delivers better performance. The results from 5.0.0.aurora_test.06f012a (new version) have lower bandwidth -- never reach peak BW of 47 GB/s. In addition, if you look at the middle range, there is a significant performance difference. Take 64 KiB. Intel version provides 17-21GB/s, while the new version delivers 3.1 -2 GB/s, so 10x performance difference.
OSU_MBW_MR measurements per yesterdays discussion. Firstly, these were done on Aurora using 8 samples and using the median value. The first measurement is from the mpich/opt/4.2.3-intel MPICH module (aka AT era MPICH from Intel):
+----------+----------+----------+
| Size | ICL=0 | ICL=1 |
+----------+----------+----------+
| 1 | 1.62 | 1.63 |
| 2 | 3.24 | 3.27 |
| 4 | 6.42 | 6.50 |
| 8 | 12.93 | 13.09 |
| 16 | 25.83 | 26.03 |
| 32 | 51.77 | 52.49 |
| 64 | 101.15 | 101.67 |
| 128 | 124.06 | 126.12 |
| 256 | 122.42 | 125.16 |
| 512 | 137.31 | 141.77 |
| 1024 | 146.12 | 151.05 |
| 2048 | 692.38 | 725.57 |
| 4096 | 1325.94 | 1448.42 |
| 8192 | 2631.66 | 2905.80 |
| 16384 | 5078.04 | 5723.12 |
| 32768 | 9537.53 | 11321.80 |
| 65536 | 17070.97 | 21104.67 |
| 131072 | 28321.90 | 34869.38 |
| 262144 | 42304.64 | 43309.90 |
| 524288 | 45953.12 | 46047.65 |
| 1048576 | 46717.84 | 46748.71 |
| 2097152 | 47068.70 | 47094.07 |
| 4194304 | 38219.88 | 47276.04 |
| 8388608 | 38280.85 | 47369.82 |
| 16777216 | 38313.15 | 47416.74 |
| 33554432 | 38324.17 | 47427.94 |
| 67108864 | 38318.54 | 47421.48 |
+----------+----------+----------+
These results were captured with these environment settings:
mpiexec -hostfile $PBS_NODEFILE -n $((NNODES * 2)) -ppn 2 --env FI_CXI_DEFAULT_CQ_SIZE=8192 --env ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 --env ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE --env MOSAPPS_TOPOLOGY_FILE=aurora-ecb.lscpu --env CMAKE_ROOT=/home/sys_seth/hpval/qmcpack/cmake --env FI_CXI_RDZV_THRESHOLD=131072 --env EnableImplicitScaling=0 --env NEOReadDebugKeys=1 --env ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 --env MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1 --env MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_THRESHOLD=0 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_NUM_BUFFERS_PER_CHUNK=4 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_MAX_NUM_BUFFERS=4 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_D2H_ENGINE_TYPE=1 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_H2D_ENGINE_TYPE=1 --env MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=${UICL} --env OMP_NUM_THREADS=1 -cpu-bind list:2:15 /home/tmusta/apps/osu/wrapper.sh /home/tmusta/apps/osu/binaries/osu_mbw_mr -m 1:67108864 -i 100 -x 20 -d sycl D D
Comment: These results are as expected. Next, we have data from 5.0.0.aurora_test.06f012a:
+----------+----------+----------+
| Size | ICL=0 | ICL=1 |
+----------+----------+----------+
| 1 | 1.53 | 1.52 |
| 2 | 3.07 | 3.05 |
| 4 | 6.11 | 6.07 |
| 8 | 12.30 | 12.20 |
| 16 | 20.02 | 19.75 |
| 32 | 40.01 | 39.47 |
| 64 | 80.55 | 80.41 |
| 128 | 154.89 | 155.65 |
| 256 | 322.99 | 322.50 |
| 512 | 583.43 | 592.40 |
| 1024 | 1049.90 | 1064.42 |
| 2048 | 1462.70 | 1473.45 |
| 4096 | 1028.30 | 1081.13 |
| 8192 | 902.06 | 873.67 |
| 16384 | 1015.62 | 996.09 |
| 32768 | 1098.14 | 1089.92 |
| 65536 | 3144.37 | 1980.47 |
| 131072 | 5097.83 | 2937.18 |
| 262144 | 5637.07 | 3847.45 |
| 524288 | 5721.43 | 4261.33 |
| 1048576 | 5157.14 | 4993.18 |
| 2097152 | 9363.45 | 9281.38 |
| 4194304 | 10915.48 | 11517.33 |
| 8388608 | 17778.07 | 18588.96 |
| 16777216 | 25815.44 | 26747.29 |
| 33554432 | 35346.85 | 35174.10 |
| 67108864 | 40698.43 | 40492.79 |
+----------+----------+----------+
These results were captured with the following settings:
mpiexec -hostfile $PBS_NODEFILE -n $((NNODES * 2)) -ppn 2 --env MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=${UICL} --env OMP_NUM_THREADS=1 -cpu-bind list:2:15 /home/tmusta/apps/osu/wrapper.sh /home/tmusta/apps/osu/binaries/osu_mbw_mr -m 1:67108864 -i 100 -x 20 -d sycl D D
@tommusta @garzaran , could you run a test with just 2 processes (i.e. osu_bw)? I am also not getting the same numbers I got as in https://github.com/pmodels/mpich/pull/7529#issuecomment-3161832162. I am getting maximum of 16.9GB rather than the maximum of 24GB. We need to identify where we are spending the latency. Your reference number, especially with mpich/opt/4.2.3-intel MPICH module will be helpful.
For reference, here is what I get with mpiexec -cpu-bind list:2 -n 2 ./tmusta/osu_mbw_mr -m 1:16777216 -d sycl D D
| Size (bytes) | MB/s | Messages/s |
|---|---|---|
| 1 | 0.86 | 855,239.15 |
| 2 | 1.72 | 858,624.41 |
| 4 | 3.44 | 859,601.43 |
| 8 | 6.88 | 860,321.44 |
| 16 | 13.75 | 859,169.73 |
| 32 | 27.55 | 860,911.19 |
| 64 | 36.52 | 570,676.76 |
| 128 | 42.17 | 329,418.01 |
| 256 | 45.62 | 178,194.31 |
| 512 | 48.83 | 95,363.93 |
| 1,024 | 39.42 | 38,498.89 |
| 2,048 | 472.22 | 230,578.57 |
| 4,096 | 602.04 | 146,983.52 |
| 8,192 | 752.62 | 91,872.32 |
| 16,384 | 1,192.92 | 72,810.11 |
| 32,768 | 2,230.35 | 68,064.74 |
| 65,536 | 3,798.41 | 57,959.09 |
| 131,072 | 5,951.31 | 45,404.93 |
| 262,144 | 8,765.53 | 33,437.86 |
| 524,288 | 11,157.01 | 21,280.30 |
| 1,048,576 | 12,846.50 | 12,251.38 |
| 2,097,152 | 12,149.80 | 5,793.48 |
| 4,194,304 | 11,347.05 | 2,705.35 |
| 8,388,608 | 10,902.85 | 1,299.72 |
| 16,777,216 | 10,880.37 | 648.52 |
@hzhou , which version of MPICH are you using? So, you want us to collect data for the command above with the same two MPICH versions as what Tom had collected?
I was using the latest main branch. Now switching to the commit 06f12a, I got my original bandwidth back (24GB/sec at very large message size). Here is the data with mpiexec -cpu-bind list:2 -n 2 ./tmusta/osu_mbw_mr -m 1024:16777216 -d sycl D D:
| Size (bytes) | MB/s |
|---|---|
| 1,024 | 23.47 |
| 2,048 | 27.68 |
| 4,096 | 32.46 |
| 8,192 | 310.01 |
| 16,384 | 596.29 |
| 32,768 | 1,119.84 |
| 65,536 | 2,185.50 |
| 131,072 | 4,197.85 |
| 262,144 | 7,740.66 |
| 524,288 | 12,786.02 |
| 1,048,576 | 19,095.17 |
| 2,097,152 | 17,133.38 |
| 4,194,304 | 16,141.68 |
| 8,388,608 | 15,642.90 |
| 16,777,216 | 18,493.83 |
@garzaran If Tom could collect the same data using mpich/opt/4.2.3-intel MPICH module (aka AT era MPICH from Intel), that would be helpful.
@hzhou , not sure where you see the peak Bw. The max I see is 18.5 GB/s. Shouldn't you get something around 23 GB/s?
I am getting the data with the bench tests in MPICH's testsuite. The binaries are here:
-rwxr-xr-x 1 hzhou users 152544 Nov 5 20:00 /home/hzhou/pull_requests/mpich7536/test/mpi/bench/p2p_bw
-rwxr-xr-x 1 hzhou users 152312 Nov 5 20:00 /home/hzhou/pull_requests/mpich7536/test/mpi/bench/p2p_one
The results with commit 0f612a is:
TEST p2p_one:
msgsize latency(sec) bandwidth(GB/s)
1000000000 0.064 15.560
1000000000 0.042 23.900
1000000000 0.042 23.947
1000000000 0.042 23.940
1000000000 0.042 23.948
No Errors
TEST p2p_bw:
msgsize latency(us) sigma(us) bandwidth(MB/s)
0 0.531 0.015 0.000
1 1.296 0.005 0.771
2 1.301 0.005 1.537
4 1.304 0.005 3.067
8 1.312 0.005 6.097
16 1.298 0.005 12.329
32 1.295 0.004 24.708
64 2.134 0.027 29.993
128 3.817 0.028 33.535
256 7.145 0.035 35.832
512 13.915 0.052 36.796
1024 37.176 0.090 27.544
2048 60.551 0.141 33.823
4096 110.509 0.323 37.065
8192 5.822 0.084 1406.994
16384 7.418 0.031 2208.559
32768 10.114 0.029 3239.973
65536 13.385 0.060 4896.348
131072 16.027 0.054 8178.205
262144 27.349 0.072 9585.310
524288 68.437 0.848 7660.832
1048576 112.309 0.147 9336.517
2097152 154.190 0.255 13601.113
4194304 240.641 0.304 17429.736
8388608 416.408 0.600 20145.186
16777216 765.278 1.553 21923.048
33554432 1474.806 2.132 22751.762
67108864 2875.220 3.307 23340.424
No Errors
EDIT: I was using MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000
EDIT2: commands I was using:
mpiexec --cpu-bind list:2 -n 2 ./p2p_one -sendmem=device -recvmem=device
mpiexec --cpu-bind list:2 -n 2 ./p2p_bw -sendmem=device -recvmem=device
EDIT3: The jump at 4096 message size is due to the default MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE. Hmm, I need check why is that. In theory, it should use MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE_H2D and MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE_D2H instead.
Never mind. That is because it is the commit 0f612a. We fixed the copy dir in https://github.com/pmodels/mpich/pull/7541
EDIT4: Okay, with current main, I am able to get the same performance back with MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE_H2D=4096. The default value of 1048576 is obviously bad and PR#7541 activated the usage of that CVAR.
I acknowledge there are some latency issues at smaller message sizes, which I am trying to trouble shoot.