ompi
ompi copied to clipboard
Coll/HAN and Coll/Adapt not default on 5.0.x
We never bumped the priority of the HAN and ADAPT collective components on the 5.0.x branch.
I'm not submitting a PR right now (bumping the priority should be easy) because, at least on EFA, Allgather and Allreduce got considerably slower when using the HAN components. Might be user error, but need to dig more.
I don't recall if there was any discussion of what priority they should be. @bosilca @janjust @gpaulsen @hppritcha @jsquyres
That was definitely the plan: to "preview" han/adapt in the 4.x series and then make it the default to replace "tuned" in 5.x.
That was my understanding as well.
So the priority for both should be higher than tuned.
Did one of han/adapt need to be a higher priority than the other? Or should they be the same priority?
IIRC, HAN has more collectives implemented. We are going to do some performance testing on HAN/Adapt/Tuned.
Before we change the priorities, someone with a device other than EFA really needs to run and see if there is benefit closed to that promised in George's paper.
@bosilca can you post the performance numbers or provide a link for reference?
Not sure what I'm expected to provide here ?
- Are there any known performance data for han/adapt compared to the current ompi defaults?
- Is there a link to the paper we view?
- Did you want to open the PR to raise the priorities? You probably have a better understanding of where these priorities should lie.
On EFA, we see essentially no performance difference between today's v5.0.x branch and running with --mca coll_han_priority 100 --mca coll_adapt_priority 100
on the OSU collective benchmarks. In some cases (allreduce in particular), it hurt performance. There are some oddities in EFA's performance on v5.0.x right now, so this may be because of EFA. My ask was that someone with a network that isn't EFA or TCP try the change and verify that the collectives actually do something good. Otherwise, we're advertising improvements that aren't going to be there.
I'm planning to investigate https://github.com/open-mpi/ompi/issues/9062 soon and will also look at the general performance of coll/adapt and coll/han on an IB system soon. Will report back once I have the numbers.
@bwbarrett is this with the OSU microbenchmarks collectives? IMB collectives? Other? Which versions.
We talked about this on the call today. @bwbarrett will be sending out some information to the devel list (and/or here) about what he ran for AWS.
A bunch of people on the call today agreed to run collective tests and see how HAN/ADAPT compared to tuned on their networks / environments. Bottom line: we need more data than just this single EFA datapoint:
- [ ] NVIDIA
- [ ] ORNL
- [ ] Cornelius
- [ ] IBM
- [ ] UTK
Cornelis OmniPath results, 2 nodes, 22 ranks per node. 1 run of each benchmark in each configuration so I haven't measured variance.
(ompi/v4.1.x, no coll/han coll/adapt MCA arguments):
+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
-host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
./mpi/collective/osu_allgather
# OSU MPI Allgather Latency Test v5.9
# Size Avg Latency(us)
1 9.54
2 9.73
4 9.95
8 11.39
16 11.91
32 12.65
64 12.58
128 15.75
256 22.78
512 37.90
1024 94.52
2048 176.84
4096 322.08
8192 642.53
16384 1251.69
32768 2312.39
65536 5103.28
131072 11283.70
262144 22618.81
524288 45343.99
1048576 90598.81
+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
-host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
./mpi/collective/osu_allreduce
# OSU MPI Allreduce Latency Test v5.9
# Size Avg Latency(us)
4 7.21
8 7.17
16 7.24
32 9.37
64 9.17
128 17.09
256 18.03
512 18.06
1024 19.10
2048 22.77
4096 29.90
8192 42.97
16384 80.84
32768 138.44
65536 242.87
131072 504.71
262144 974.78
524288 1922.23
1048576 3875.57
(ompi/v4.1.x, with coll/han coll/adapt MCA arguments):
+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
-host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
--mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allgather
# OSU MPI Allgather Latency Test v5.9
# Size Avg Latency(us)
1 10.67
2 10.34
4 11.25
8 11.57
16 10.35
32 11.39
64 12.60
128 15.61
256 21.30
512 40.22
1024 68.57
2048 128.42
4096 244.77
8192 463.33
16384 825.39
32768 1563.50
65536 3060.89
131072 7140.29
262144 16518.99
524288 32637.53
1048576 63536.21
+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
-host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
--mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allreduce
# OSU MPI Allreduce Latency Test v5.9
# Size Avg Latency(us)
4 8.44
8 8.29
16 9.22
32 11.92
64 11.42
128 11.76
256 12.04
512 12.64
1024 14.87
2048 18.22
4096 26.91
8192 43.80
16384 62.63
32768 115.12
65536 223.10
131072 432.17
262144 735.30
524288 1545.48
1048576 3173.03
(ompi/main, no coll/han MCA coll/adapt arguments):
+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
-host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
./mpi/collective/osu_allgather
# OSU MPI Allgather Latency Test v5.9
# Size Avg Latency(us)
1 7.70
2 7.80
4 8.08
8 9.19
16 10.07
32 11.41
64 12.90
128 15.88
256 23.07
512 38.38
1024 95.73
2048 166.95
4096 327.29
8192 652.92
16384 1263.14
32768 2332.43
65536 5140.10
131072 11231.33
262144 22839.35
524288 45512.69
1048576 90426.54
+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
-host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
./mpi/collective/osu_allreduce
# OSU MPI Allreduce Latency Test v5.9
# Size Avg Latency(us)
4 7.76
8 7.64
16 7.53
32 9.33
64 9.24
128 17.82
256 17.74
512 18.11
1024 18.94
2048 23.07
4096 29.85
8192 42.78
16384 81.19
32768 133.46
65536 248.76
131072 510.12
262144 981.97
524288 1989.39
1048576 3901.91
(ompi/main, with coll/han coll/adapt MCA arguments):
+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
-host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
--mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allgather
# OSU MPI Allgather Latency Test v5.9
# Size Avg Latency(us)
1 9.43
2 8.97
4 9.89
8 10.20
16 10.74
32 12.23
64 13.29
128 16.28
256 22.00
512 40.14
1024 67.05
2048 126.75
4096 259.95
8192 448.19
16384 823.10
32768 1589.41
65536 3067.19
131072 7037.30
262144 16388.66
524288 32337.04
1048576 63411.71
+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
-host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
--mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allreduce
# OSU MPI Allreduce Latency Test v5.9
# Size Avg Latency(us)
4 12.36
8 11.79
16 13.08
32 14.66
64 14.34
128 14.52
256 14.82
512 15.36
1024 17.52
2048 20.18
4096 25.30
8192 36.60
16384 67.08
32768 114.37
65536 236.39
131072 407.80
262144 668.37
524288 1473.07
1048576 2795.97
X86 ConnectX-6 cluster, 32 nodes, 40 PPN. OMPI V5.0.X
Any value above 0% is Adapt/HAN outperforming Tuned.
mpirun -np 1280 --map-by node -x UCX_NET_DEVICES=mlx5_0:1 --mca pml ucx --mca coll tuned,libnbc,basic -x UCX_WARN_UNUSED_ENV_VARS=n -x LD_LIBRARY_PATH ${exe}
mpirun -np 1280 --map-by node -x UCX_NET_DEVICES=mlx5_0:1 --mca pml ucx --mca coll adapt,han,libnbc,basic --mca coll_adapt_priority 100 --mca coll_han_priority 100 -x UCX_WARN_UNUSED_ENV_VARS=n -x LD_LIBRARY_PATH ${exe}
Here's some runs with han + adapt compared to current collective defaults using ob1 on power9 using the v5.0.x branch.
6 nodes at 16 ppn. Testing at 40 ppn seems to have similar (or maybe better) results for han..But I did not aggregate them. A negative % indicates that han/adapt did better, higher percentage it did worse.
Running on these same machines with mofed 4.9 + ucx 1.10.1 showed little to no difference when comparing the defaults v. han/adapt, so I didn't bother posting the graphs.
I seem to be running into issues running with han/adapt with the imb benchmarks, for example (with --map-by node)
# Iscatter
#------------------------------------------------------------------------------------------
# Benchmarking Iscatter
# #processes = 640
#------------------------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%] defects
0 11 0.00 0.00 0.00 0.00 0.00
1 11 0.00 0.00 0.00 0.00 0.00
2 11 0.00 0.00 0.00 0.00 0.00
4 11 0.00 0.00 0.00 0.00 0.00
8 11 0.00 0.00 0.00 0.00 0.00
16 11 0.00 0.00 0.00 0.00 0.00
32 11 0.00 0.00 0.00 0.00 0.00
64 11 0.00 0.00 0.00 0.00 0.00
128 11 0.00 0.00 0.00 0.00 0.00
256 11 0.00 0.00 0.00 0.00 0.00
512 11 0.00 0.00 0.00 0.00 0.00
1024 11 0.00 0.00 0.00 0.00 0.00
which is slightly worrisome. Has anyone else run imb with han/adapt, and gotten actual numbers? I seem to get numbers with --map-by core, --map-by node
is what triggers the above.
@awlauria, your graphs don't have x,y labels. Is your y the same as @janjust 's y scale (ie. anything above 0 is han/adapt performing better, and anything below is worse)?
Oh, foo. You are right. that isn't very clear...
It's actually the opposite. The Y axis measures the performance improvement of Han/Adapt v. the defaults, the X axis is message size. So for my graphs anything below 0 means that Han/adapts time for the same test was X% lower than the default ...So below 0 represents an improvement for Han/Adapt.
It's still on my to-do list to re-run these to confirm my findings.
So it looks like HAN/Adapt outperform Tuned except at the largest message sizes, at least for your 6 node tests.
Correct. I would like confirmation of that. I will re-run these numbers, perhaps at a slightly larger scale. Will try to do that by the end of this week.
I'm also attaching some benchmarks I performed at some point. I have only experimented with bcast, reduce, allreduce, barrier.
Settings
v5.0.x (rc7, I believe)
map by & bind to core
pml=ob1 btl=sm,uct smsc=xpmem
For the fine-tuned configurations:
tuned fine-tuned
use_dynamic_rules=true
bcast_algorithm=7 (knomial)
bcast_algorithm_segmentsize=128K
bcast_algorithm_knomial_radix=2
reduce_algorithm=6 (in-order binary)
reduce_algorithm_segmentsize=128K
allreduce_algorithm=2 (nonoverlapping (reduce+bcast))
adapt fine-tuned
bcast_algorithm=2 (in_order_binomial)
bcast_segment_size=128K
reduce_algorithm=2 (in_order_binomial)
reduce_segment_size=128K
The <component>+<component>
configurations imply HAN (format: <up module>+<low module>
)
Full collection: plots.tar.gz
Some of them:
FYI at the moment, these 3 issues can impact HAN:
#10335 #10456 #10458
So if you attempt to use the MCA parameters and adjust the chosen sub-components, I suggest applying the fixes and/or verifying that the expected components are actually used.
Sorry for the delay but here are some measurements for HAN on Hawk. I measured several different configurations, including 8, 24, 48, and 64 processes per node. I also ran with coll/sm
as the backend but the differences seem minor. I also increased the HAN segment size to 256k (and did the same for coll/sm
), which seems to have a positive impact on larger data sizes.
Takeaway: there are certain configuration where coll/tuned
is faster than coll/han
:
- For small messages at 1k procs on 16 nodes. I assume there is some algorithm change that plays nicely with the way OSU measures time.
-
coll/han
seems to consistently get slower for larger data sizes. The increased segment size has helped here but further increasing the segment size didn't have an impact for me. That needs some more investigation. - Overall,
coll/han
shows consistent performance whereascoll/tuned
has huge variations between configuration. I assume this is an artifact of the opaque tuning decisions.
Unfortunately, not all runs were successful (all runs in that job aborted).
8 Procs per node:
24 Procs per node:
48 Procs per node:
64 Procs per node:
Here is how I configured my runs:
-
coll/tuned
:mpirun --mca coll ^hcoll --rank-by ${rankby} $mapby -N $npn -n $nprocs --bind-to core --mca coll_han_priority 0 --mca coll_hcoll_enable 0
-
coll/han
:mpirun --mca coll ^hcoll --rank-by ${rankby} $mapby -N $npn -n $nprocs --bind-to core --mca coll_han_priority 100 --mca coll_han_reduce_segsize $((256*1024)) --mca coll_hcoll_enable 0
-
coll/han
withcoll/sm
:mpirun --rank-by ${rankby} $mapby -N $npn -n $((nprocs)) --bind-to core --mca coll_han_priority 100 --mca coll_hcoll_enable 0 --mca coll_han_reduce_segsize $((256*1024)) --mca coll_sm_priority 80 --mca coll_sm_fragment_size $((260*1024))
Closed by #11362 and #11389