mpich icon indicating copy to clipboard operation
mpich copied to clipboard

[Aurora] GPU pipelining failure/performance degradation

Open rithwiktom opened this issue 6 months ago • 28 comments

When using mpich/opt/develop-git.6037a7a on Aurora, I notice that the following test crashes for large message size (~64MB) when GPU pipelining is turned on. For smaller message sizes, there is a performance regression. mpich/opt/4.2.3-intel does not seem to have this issue.

This is the performance upto 4MB using the two versions:

message size GPU PIPLN/ develop-git.6037a7a (MB/s) GPU PIPLN/4.2.3-intel (MB/s)
1 1 1.61
2 1.59 3.23
4 3.19 6.45
8 6.38 12.96
16 12.78 25.89
32 25.57 51.75
64 50.78 100.14
128 36.33 123.29
256 94.03 121.6
512 98.78 134.72
1024 106.68 143.59
2048 111.51 732.25
4096 113.87 1472.69
8192 109.69 2957.42
16384 110.34 5861.81
32768 107.82 11590.16
65536 21586.53 21417.23
131072 28338.11 34983.17
262144 30891.85 43363.4
524288 32059.07 46077.84
1048576 28726.47 46744.3
2097152 29452.7 47071.55
4194304 29858.49 47247.48

This is the test:

export FI_CXI_RDZV_THRESHOLD=131072
export EnableImplicitScaling=0
export NEOReadDebugKeys=1
export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
export MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1

# Enable GPU pipelining
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_THRESHOLD=0
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_NUM_BUFFERS_PER_CHUNK=4
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_MAX_NUM_BUFFERS=4
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_D2H_ENGINE_TYPE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_H2D_ENGINE_TYPE=1

mpiexec -np 4 -ppn 2  --cpu-bind list:2:15  ~/gpu_wrappers/2-2.sh  $PATH_TO_OSU/pt2pt/osu_mbw_mr -m 1:67108864 -i 100 -x 20 -d ze D D 

The wrapper script used here is

#!/bin/bash
 
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
 
if [ $PALS_LOCAL_RANKID -eq 0 ]
then
    AFFINITY_MASK=0.0
    NIC_NUM=cxi0
elif [ $PALS_LOCAL_RANKID -eq 1 ]
then
    AFFINITY_MASK=1.0
    NIC_NUM=cxi1
fi
 
echo "[I am rank $PALS_RANKID] Localrank=$PALS_LOCAL_RANKID : Affinity mask = $AFFINITY_MASK, PREFERRED_NIC =  $NIC_NUM"
 
export ZE_AFFINITY_MASK=$AFFINITY_MASK
export FI_CXI_DEVICE_NAME=$NIC_NUM
 
# Invoke the main program
$*

rithwiktom avatar Jun 18 '25 19:06 rithwiktom

@rithwiktom It will be impossible to suggest or configure or verify regressions if we have to use dozens of env variables that are non default. Please open a PR on what is Intel formally supported default options related to Intel GPUs for these knobs.

servesh avatar Jun 23 '25 17:06 servesh

We should consider if setting MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1 by default is the right thing to do rather than require users (or modules) to set it on Aurora and other systems.

raffenet avatar Jun 25 '25 19:06 raffenet

https://github.com/pmodels/mpich/pull/7516 should fix the crash.

hzhou avatar Jul 22 '25 16:07 hzhou

Did you also check the performance?

garzaran avatar Jul 28 '25 20:07 garzaran

We should consider if setting MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1 by default is the right thing to do rather than require users (or modules) to set it on Aurora and other systems.

It used to the case that using immediate command list causes some workload to crash. It seemed to be L0 issues, so this was left to be off by default. However, with the new UMD, the bug may have been fixed so we should reevaluate it again.

zhenggb72 avatar Jul 28 '25 21:07 zhenggb72

@zhenggb72 If would be good if there are such cases, some of these values aren't set that specific JIRAs are opened and addressed. We won't be able to track regression with non standard options.

servesh avatar Jul 28 '25 21:07 servesh

Did you also check the performance?

I got ~ 15GB/sec. Haven't investigated further yet.

hzhou avatar Jul 28 '25 22:07 hzhou

Yeah, this was many years ago, and it was some very complicated bug with the workload and it took probably almost a year for L0 team to track down. On the other hand, using regular command list can buffer the command and submit only once which actually may reduce the submission overheads. So there is pros and cons. Note that as L0 keep optimizing immediate cmd list, things may also have changed. That is why I suggest we can re-evaluate the default.

zhenggb72 avatar Jul 28 '25 22:07 zhenggb72

@rithwiktom The original pipeline algorithm will be replaced in https://github.com/pmodels/mpich/pull/7529. Could you test and evaluate the performance of PR7529? You need set MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD to enable the pipeline path in the new PR.

hzhou avatar Aug 11 '25 14:08 hzhou

I suspect https://github.com/pmodels/mpich/pull/7168/commits/05883b6a6c652bb1cbdcd81f68fa34c9f27e0445 is the cause for the performance change, at least for the low to medium range. The jump at 32768 is indicative of MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE_D2H added in that commit

hzhou avatar Aug 12 '25 03:08 hzhou

Hi @hzhou, I'm leaving Intel. Please reach out to Maria/Gengbin if you need anything.

rithwiktom avatar Aug 21 '25 22:08 rithwiktom

Thanks for letting us know. Best wishes!

hzhou avatar Aug 21 '25 22:08 hzhou

@rithwiktom The original pipeline algorithm will be replaced in #7529. Could you test and evaluate the performance of PR7529? You need set MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD to enable the pipeline path in the new PR.

Is this ready for testing?

garzaran avatar Aug 26 '25 18:08 garzaran

Yes, the PR has been merged in to main.

hzhou avatar Aug 26 '25 18:08 hzhou

Do we need to use any env variable to enable this code path? Could you give us the command to run this code path? You can see the env variables that Rithwik used in the comment above.

garzaran avatar Aug 26 '25 18:08 garzaran

Set MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000000 to make all messages 1MB above to use the new ofi RNDV path. There are 4 RNDV protocols:

  1. pipeline - send packs to genq chunks and send chunk using fi_send, receiver receives each chunk and unpacks to receive buffer
  2. RDMA read - only works when sender does not require packing. Sender registers send buffer; receiver issues fi_read to get the data. If the receive buffer requires packing, it reads into genq chunk buffer and unpacks each chunk similar to pipeline receive.
  3. RDMA write - only works when receiver does not require packing. Receiver registers receive buffer and sender issues fi_write. If sender requires packing, it packs to genq buffer and writes from chunks similar to pipeline send.
  4. direct. Directly use fi_send/fi_recv after the initial RTS/CTS. Only works when both sender/receiver does not require packing.

Use MPIR_CVAR_CH4_OFI_RNDV_PROTOCOL to force rndv protocol (auto, pipeline, read, write, and direct). The default is auto, which selects the protocol based on whether sender or receiver requires (or prefers) packing. If HMEM is not enabled, then all gpu buffers will require packing. So for GPU to GPU, it will use the pipeline protocol by default.

If multiple nics is enabled (MPIR_CVAR_CH4_OFI_MAX_NICS), pipeline will be distributed over all available nics, achieving striping effect. However, I have not be able to scale the multiple nic performance on Aurora yet. Multi-bandwidth with multiple process on each node can scale perfectly.

MPIR_CVAR_CH4_OFI_PIPELINE_CHUNK_SZ and MPIR_CVAR_CH4_OFI_PIPELINE_NUM_CHUNKS controls the pipelining. RDMA read/write will use the same CVAR if it is using pipelined read/write.

MPIR_CVAR_CH4_OFI_GPU_{SEND,RECV}_ENGINE_TYPE controls the ZE copy engine (for pipeline packing). - (side question: What is the reason we differentiate send and recv?)

MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE{,_H2D,_D2H} controls the threshold for fast copy. The default definitely need adjustment for the current Aurora. See https://github.com/pmodels/mpich/pull/7541

hzhou avatar Aug 26 '25 19:08 hzhou

@hzhou , I have two asks: a) I am going to ask Tom Musta to collect performance data. He is not an MPICH expert. Could you generate a binary for him so that he can test it and let him know the location? b) We need the exact env variables we need to test. I head is spinning after looking at your description. If needed, we can have a quick call and decide what to test. Let us start small and we can add more testing if needed.

garzaran avatar Aug 26 '25 20:08 garzaran

On sunspot the branch (aurora_test) has been loaded as default, so he can directly test it there. @servesh is building the module for aurora as well. I think once he finish, it will be the default on aurora as well. @servesh please confirm.

He only need set MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000000 to perform the tests described in head of this issue. @servesh is the CVAR set in the lmod? If that's the case, then he won't need set anything.

hzhou avatar Aug 26 '25 20:08 hzhou

Next PE update on Sunspot will have this set in the mpich lmod module. Focused on testing things on Sunspot ATM.

servesh avatar Aug 26 '25 21:08 servesh

I do not have a sunspot account (at least I've never logged in) .... Hopefully it will make its way to Aurora as a preview after some testing on Sunspot. Or I can see if Chris C is able to pull modules from Sunspot back to Borealis.

tommusta avatar Aug 28 '25 16:08 tommusta

@hzhou

@tommusta re-ran experiments in Aurora. Below is a summary and his comments.

As a summary, the old Intel provided code performs as expected and immediate command list delivers better performance. The results from 5.0.0.aurora_test.06f012a (new version) have lower bandwidth -- never reach peak BW of 47 GB/s. In addition, if you look at the middle range, there is a significant performance difference. Take 64 KiB. Intel version provides 17-21GB/s, while the new version delivers 3.1 -2 GB/s, so 10x performance difference.

OSU_MBW_MR measurements per yesterdays discussion. Firstly, these were done on Aurora using 8 samples and using the median value. The first measurement is from the mpich/opt/4.2.3-intel MPICH module (aka AT era MPICH from Intel):

+----------+----------+----------+
| Size | ICL=0 | ICL=1 |
+----------+----------+----------+
| 1 | 1.62 | 1.63 |
| 2 | 3.24 | 3.27 |
| 4 | 6.42 | 6.50 |
| 8 | 12.93 | 13.09 |
| 16 | 25.83 | 26.03 |
| 32 | 51.77 | 52.49 |
| 64 | 101.15 | 101.67 |
| 128 | 124.06 | 126.12 |
| 256 | 122.42 | 125.16 |
| 512 | 137.31 | 141.77 |
| 1024 | 146.12 | 151.05 |
| 2048 | 692.38 | 725.57 |
| 4096 | 1325.94 | 1448.42 |
| 8192 | 2631.66 | 2905.80 |
| 16384 | 5078.04 | 5723.12 |
| 32768 | 9537.53 | 11321.80 |
| 65536 | 17070.97 | 21104.67 |
| 131072 | 28321.90 | 34869.38 |
| 262144 | 42304.64 | 43309.90 |
| 524288 | 45953.12 | 46047.65 |
| 1048576 | 46717.84 | 46748.71 |
| 2097152 | 47068.70 | 47094.07 |
| 4194304 | 38219.88 | 47276.04 |
| 8388608 | 38280.85 | 47369.82 |
| 16777216 | 38313.15 | 47416.74 |
| 33554432 | 38324.17 | 47427.94 |
| 67108864 | 38318.54 | 47421.48 |
+----------+----------+----------+

These results were captured with these environment settings:

mpiexec -hostfile $PBS_NODEFILE -n $((NNODES * 2)) -ppn 2 --env FI_CXI_DEFAULT_CQ_SIZE=8192 --env ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 --env ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE --env MOSAPPS_TOPOLOGY_FILE=aurora-ecb.lscpu --env CMAKE_ROOT=/home/sys_seth/hpval/qmcpack/cmake --env FI_CXI_RDZV_THRESHOLD=131072 --env EnableImplicitScaling=0 --env NEOReadDebugKeys=1 --env ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 --env MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1 --env MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_THRESHOLD=0 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_NUM_BUFFERS_PER_CHUNK=4 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_MAX_NUM_BUFFERS=4 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_D2H_ENGINE_TYPE=1 --env MPIR_CVAR_CH4_OFI_GPU_PIPELINE_H2D_ENGINE_TYPE=1 --env MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=${UICL} --env OMP_NUM_THREADS=1 -cpu-bind list:2:15 /home/tmusta/apps/osu/wrapper.sh /home/tmusta/apps/osu/binaries/osu_mbw_mr -m 1:67108864 -i 100 -x 20 -d sycl D D

Comment: These results are as expected. Next, we have data from 5.0.0.aurora_test.06f012a:

+----------+----------+----------+
| Size | ICL=0 | ICL=1 |
+----------+----------+----------+
| 1 | 1.53 | 1.52 |
| 2 | 3.07 | 3.05 |
| 4 | 6.11 | 6.07 |
| 8 | 12.30 | 12.20 |
| 16 | 20.02 | 19.75 |
| 32 | 40.01 | 39.47 |
| 64 | 80.55 | 80.41 |
| 128 | 154.89 | 155.65 |
| 256 | 322.99 | 322.50 |
| 512 | 583.43 | 592.40 |
| 1024 | 1049.90 | 1064.42 |
| 2048 | 1462.70 | 1473.45 |
| 4096 | 1028.30 | 1081.13 |
| 8192 | 902.06 | 873.67 |
| 16384 | 1015.62 | 996.09 |
| 32768 | 1098.14 | 1089.92 |
| 65536 | 3144.37 | 1980.47 |
| 131072 | 5097.83 | 2937.18 |
| 262144 | 5637.07 | 3847.45 |
| 524288 | 5721.43 | 4261.33 |
| 1048576 | 5157.14 | 4993.18 |
| 2097152 | 9363.45 | 9281.38 |
| 4194304 | 10915.48 | 11517.33 |
| 8388608 | 17778.07 | 18588.96 |
| 16777216 | 25815.44 | 26747.29 |
| 33554432 | 35346.85 | 35174.10 |
| 67108864 | 40698.43 | 40492.79 |
+----------+----------+----------+

These results were captured with the following settings:

mpiexec -hostfile $PBS_NODEFILE -n $((NNODES * 2)) -ppn 2 --env MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=${UICL} --env OMP_NUM_THREADS=1 -cpu-bind list:2:15 /home/tmusta/apps/osu/wrapper.sh /home/tmusta/apps/osu/binaries/osu_mbw_mr -m 1:67108864 -i 100 -x 20 -d sycl D D

garzaran avatar Oct 29 '25 20:10 garzaran

@tommusta @garzaran , could you run a test with just 2 processes (i.e. osu_bw)? I am also not getting the same numbers I got as in https://github.com/pmodels/mpich/pull/7529#issuecomment-3161832162. I am getting maximum of 16.9GB rather than the maximum of 24GB. We need to identify where we are spending the latency. Your reference number, especially with mpich/opt/4.2.3-intel MPICH module will be helpful.

hzhou avatar Nov 05 '25 17:11 hzhou

For reference, here is what I get with mpiexec -cpu-bind list:2 -n 2 ./tmusta/osu_mbw_mr -m 1:16777216 -d sycl D D

Size (bytes) MB/s Messages/s
1 0.86 855,239.15
2 1.72 858,624.41
4 3.44 859,601.43
8 6.88 860,321.44
16 13.75 859,169.73
32 27.55 860,911.19
64 36.52 570,676.76
128 42.17 329,418.01
256 45.62 178,194.31
512 48.83 95,363.93
1,024 39.42 38,498.89
2,048 472.22 230,578.57
4,096 602.04 146,983.52
8,192 752.62 91,872.32
16,384 1,192.92 72,810.11
32,768 2,230.35 68,064.74
65,536 3,798.41 57,959.09
131,072 5,951.31 45,404.93
262,144 8,765.53 33,437.86
524,288 11,157.01 21,280.30
1,048,576 12,846.50 12,251.38
2,097,152 12,149.80 5,793.48
4,194,304 11,347.05 2,705.35
8,388,608 10,902.85 1,299.72
16,777,216 10,880.37 648.52

hzhou avatar Nov 05 '25 17:11 hzhou

@hzhou , which version of MPICH are you using? So, you want us to collect data for the command above with the same two MPICH versions as what Tom had collected?

garzaran avatar Nov 05 '25 18:11 garzaran

I was using the latest main branch. Now switching to the commit 06f12a, I got my original bandwidth back (24GB/sec at very large message size). Here is the data with mpiexec -cpu-bind list:2 -n 2 ./tmusta/osu_mbw_mr -m 1024:16777216 -d sycl D D:

Size (bytes) MB/s
1,024 23.47
2,048 27.68
4,096 32.46
8,192 310.01
16,384 596.29
32,768 1,119.84
65,536 2,185.50
131,072 4,197.85
262,144 7,740.66
524,288 12,786.02
1,048,576 19,095.17
2,097,152 17,133.38
4,194,304 16,141.68
8,388,608 15,642.90
16,777,216 18,493.83

@garzaran If Tom could collect the same data using mpich/opt/4.2.3-intel MPICH module (aka AT era MPICH from Intel), that would be helpful.

hzhou avatar Nov 05 '25 18:11 hzhou

@hzhou , not sure where you see the peak Bw. The max I see is 18.5 GB/s. Shouldn't you get something around 23 GB/s?

garzaran avatar Nov 05 '25 19:11 garzaran

I am getting the data with the bench tests in MPICH's testsuite. The binaries are here:

-rwxr-xr-x 1 hzhou users 152544 Nov  5 20:00 /home/hzhou/pull_requests/mpich7536/test/mpi/bench/p2p_bw 
-rwxr-xr-x 1 hzhou users 152312 Nov  5 20:00 /home/hzhou/pull_requests/mpich7536/test/mpi/bench/p2p_one

The results with commit 0f612a is:

TEST p2p_one:                                            
     msgsize latency(sec) bandwidth(GB/s)                
  1000000000      0.064       15.560                     
  1000000000      0.042       23.900                     
  1000000000      0.042       23.947                     
  1000000000      0.042       23.940                     
  1000000000      0.042       23.948                     
                                                         
 No Errors                                               
TEST p2p_bw:                                             
     msgsize    latency(us)  sigma(us)    bandwidth(MB/s)
           0      0.531      0.015            0.000      
           1      1.296      0.005            0.771      
           2      1.301      0.005            1.537      
           4      1.304      0.005            3.067      
           8      1.312      0.005            6.097      
          16      1.298      0.005           12.329      
          32      1.295      0.004           24.708      
          64      2.134      0.027           29.993      
         128      3.817      0.028           33.535      
         256      7.145      0.035           35.832      
         512     13.915      0.052           36.796      
        1024     37.176      0.090           27.544      
        2048     60.551      0.141           33.823      
        4096    110.509      0.323           37.065      
        8192      5.822      0.084         1406.994      
       16384      7.418      0.031         2208.559      
       32768     10.114      0.029         3239.973      
       65536     13.385      0.060         4896.348      
      131072     16.027      0.054         8178.205      
      262144     27.349      0.072         9585.310      
      524288     68.437      0.848         7660.832      
     1048576    112.309      0.147         9336.517      
     2097152    154.190      0.255        13601.113      
     4194304    240.641      0.304        17429.736      
     8388608    416.408      0.600        20145.186      
    16777216    765.278      1.553        21923.048      
    33554432   1474.806      2.132        22751.762      
    67108864   2875.220      3.307        23340.424      
                                                         
 No Errors                                               

EDIT: I was using MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000 EDIT2: commands I was using:

mpiexec --cpu-bind list:2 -n 2 ./p2p_one -sendmem=device -recvmem=device
mpiexec --cpu-bind list:2 -n 2 ./p2p_bw -sendmem=device -recvmem=device

EDIT3: The jump at 4096 message size is due to the default MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE. Hmm, I need check why is that. In theory, it should use MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE_H2D and MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE_D2H instead.

Never mind. That is because it is the commit 0f612a. We fixed the copy dir in https://github.com/pmodels/mpich/pull/7541

EDIT4: Okay, with current main, I am able to get the same performance back with MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE_H2D=4096. The default value of 1048576 is obviously bad and PR#7541 activated the usage of that CVAR.

hzhou avatar Nov 05 '25 20:11 hzhou

I acknowledge there are some latency issues at smaller message sizes, which I am trying to trouble shoot.

hzhou avatar Nov 05 '25 20:11 hzhou