ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Coll/HAN and Coll/Adapt not default on 5.0.x

Open bwbarrett opened this issue 2 years ago • 24 comments

We never bumped the priority of the HAN and ADAPT collective components on the 5.0.x branch.

I'm not submitting a PR right now (bumping the priority should be easy) because, at least on EFA, Allgather and Allreduce got considerably slower when using the HAN components. Might be user error, but need to dig more.

bwbarrett avatar May 03 '22 03:05 bwbarrett

I don't recall if there was any discussion of what priority they should be. @bosilca @janjust @gpaulsen @hppritcha @jsquyres

awlauria avatar May 10 '22 16:05 awlauria

That was definitely the plan: to "preview" han/adapt in the 4.x series and then make it the default to replace "tuned" in 5.x.

jsquyres avatar May 10 '22 16:05 jsquyres

That was my understanding as well.

hppritcha avatar May 10 '22 16:05 hppritcha

So the priority for both should be higher than tuned.

Did one of han/adapt need to be a higher priority than the other? Or should they be the same priority?

awlauria avatar May 10 '22 16:05 awlauria

IIRC, HAN has more collectives implemented. We are going to do some performance testing on HAN/Adapt/Tuned.

wckzhang avatar May 10 '22 17:05 wckzhang

Before we change the priorities, someone with a device other than EFA really needs to run and see if there is benefit closed to that promised in George's paper.

bwbarrett avatar May 11 '22 21:05 bwbarrett

@bosilca can you post the performance numbers or provide a link for reference?

awlauria avatar May 12 '22 14:05 awlauria

Not sure what I'm expected to provide here ?

bosilca avatar May 12 '22 14:05 bosilca

  1. Are there any known performance data for han/adapt compared to the current ompi defaults?
  2. Is there a link to the paper we view?
  3. Did you want to open the PR to raise the priorities? You probably have a better understanding of where these priorities should lie.

awlauria avatar May 12 '22 15:05 awlauria

On EFA, we see essentially no performance difference between today's v5.0.x branch and running with --mca coll_han_priority 100 --mca coll_adapt_priority 100 on the OSU collective benchmarks. In some cases (allreduce in particular), it hurt performance. There are some oddities in EFA's performance on v5.0.x right now, so this may be because of EFA. My ask was that someone with a network that isn't EFA or TCP try the change and verify that the collectives actually do something good. Otherwise, we're advertising improvements that aren't going to be there.

bwbarrett avatar May 13 '22 17:05 bwbarrett

I'm planning to investigate https://github.com/open-mpi/ompi/issues/9062 soon and will also look at the general performance of coll/adapt and coll/han on an IB system soon. Will report back once I have the numbers.

devreal avatar May 13 '22 17:05 devreal

@bwbarrett is this with the OSU microbenchmarks collectives? IMB collectives? Other? Which versions.

BrendanCunningham avatar May 24 '22 15:05 BrendanCunningham

We talked about this on the call today. @bwbarrett will be sending out some information to the devel list (and/or here) about what he ran for AWS.

A bunch of people on the call today agreed to run collective tests and see how HAN/ADAPT compared to tuned on their networks / environments. Bottom line: we need more data than just this single EFA datapoint:

  • [ ] NVIDIA
  • [ ] ORNL
  • [ ] Cornelius
  • [ ] IBM
  • [ ] UTK

jsquyres avatar May 24 '22 15:05 jsquyres

Cornelis OmniPath results, 2 nodes, 22 ranks per node. 1 run of each benchmark in each configuration so I haven't measured variance.

(ompi/v4.1.x, no coll/han coll/adapt MCA arguments):

+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      ./mpi/collective/osu_allgather

# OSU MPI Allgather Latency Test v5.9
# Size       Avg Latency(us)
1                       9.54
2                       9.73
4                       9.95
8                      11.39
16                     11.91
32                     12.65
64                     12.58
128                    15.75
256                    22.78
512                    37.90
1024                   94.52
2048                  176.84
4096                  322.08
8192                  642.53
16384                1251.69
32768                2312.39
65536                5103.28
131072              11283.70
262144              22618.81
524288              45343.99
1048576             90598.81

+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      ./mpi/collective/osu_allreduce

# OSU MPI Allreduce Latency Test v5.9
# Size       Avg Latency(us)
4                       7.21
8                       7.17
16                      7.24
32                      9.37
64                      9.17
128                    17.09
256                    18.03
512                    18.06
1024                   19.10
2048                   22.77
4096                   29.90
8192                   42.97
16384                  80.84
32768                 138.44
65536                 242.87
131072                504.71
262144                974.78
524288               1922.23
1048576              3875.57

(ompi/v4.1.x, with coll/han coll/adapt MCA arguments):

+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      --mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allgather

# OSU MPI Allgather Latency Test v5.9
# Size       Avg Latency(us)
1                      10.67
2                      10.34
4                      11.25
8                      11.57
16                     10.35
32                     11.39
64                     12.60
128                    15.61
256                    21.30
512                    40.22
1024                   68.57
2048                  128.42
4096                  244.77
8192                  463.33
16384                 825.39
32768                1563.50
65536                3060.89
131072               7140.29
262144              16518.99
524288              32637.53
1048576             63536.21

+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      --mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allreduce

# OSU MPI Allreduce Latency Test v5.9
# Size       Avg Latency(us)
4                       8.44
8                       8.29
16                      9.22
32                     11.92
64                     11.42
128                    11.76
256                    12.04
512                    12.64
1024                   14.87
2048                   18.22
4096                   26.91
8192                   43.80
16384                  62.63
32768                 115.12
65536                 223.10
131072                432.17
262144                735.30
524288               1545.48
1048576              3173.03

(ompi/main, no coll/han MCA coll/adapt arguments):

+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      ./mpi/collective/osu_allgather

# OSU MPI Allgather Latency Test v5.9
# Size       Avg Latency(us)
1                       7.70
2                       7.80
4                       8.08
8                       9.19
16                     10.07
32                     11.41
64                     12.90
128                    15.88
256                    23.07
512                    38.38
1024                   95.73
2048                  166.95
4096                  327.29
8192                  652.92
16384                1263.14
32768                2332.43
65536                5140.10
131072              11231.33
262144              22839.35
524288              45512.69
1048576             90426.54

+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      ./mpi/collective/osu_allreduce

# OSU MPI Allreduce Latency Test v5.9
# Size       Avg Latency(us)
4                       7.76
8                       7.64
16                      7.53
32                      9.33
64                      9.24
128                    17.82
256                    17.74
512                    18.11
1024                   18.94
2048                   23.07
4096                   29.85
8192                   42.78
16384                  81.19
32768                 133.46
65536                 248.76
131072                510.12
262144                981.97
524288               1989.39
1048576              3901.91

(ompi/main, with coll/han coll/adapt MCA arguments):

+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      --mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allgather

# OSU MPI Allgather Latency Test v5.9
# Size       Avg Latency(us)
1                       9.43
2                       8.97
4                       9.89
8                      10.20
16                     10.74
32                     12.23
64                     13.29
128                    16.28
256                    22.00
512                    40.14
1024                   67.05
2048                  126.75
4096                  259.95
8192                  448.19
16384                 823.10
32768                1589.41
65536                3067.19
131072               7037.30
262144              16388.66
524288              32337.04
1048576             63411.71

+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      --mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allreduce

# OSU MPI Allreduce Latency Test v5.9
# Size       Avg Latency(us)
4                      12.36
8                      11.79
16                     13.08
32                     14.66
64                     14.34
128                    14.52
256                    14.82
512                    15.36
1024                   17.52
2048                   20.18
4096                   25.30
8192                   36.60
16384                  67.08
32768                 114.37
65536                 236.39
131072                407.80
262144                668.37
524288               1473.07
1048576              2795.97

BrendanCunningham avatar Jun 10 '22 20:06 BrendanCunningham

X86 ConnectX-6 cluster, 32 nodes, 40 PPN. OMPI V5.0.X

Any value above 0% is Adapt/HAN outperforming Tuned.

mpirun -np 1280 --map-by node -x UCX_NET_DEVICES=mlx5_0:1 --mca pml ucx --mca coll tuned,libnbc,basic -x UCX_WARN_UNUSED_ENV_VARS=n -x LD_LIBRARY_PATH ${exe}

mpirun -np 1280 --map-by node -x UCX_NET_DEVICES=mlx5_0:1 --mca pml ucx --mca coll   adapt,han,libnbc,basic --mca coll_adapt_priority 100 --mca coll_han_priority   100 -x UCX_WARN_UNUSED_ENV_VARS=n -x LD_LIBRARY_PATH ${exe}

image

image

image

janjust avatar Jun 13 '22 19:06 janjust

Here's some runs with han + adapt compared to current collective defaults using ob1 on power9 using the v5.0.x branch.

6 nodes at 16 ppn. Testing at 40 ppn seems to have similar (or maybe better) results for han..But I did not aggregate them. A negative % indicates that han/adapt did better, higher percentage it did worse.

Running on these same machines with mofed 4.9 + ucx 1.10.1 showed little to no difference when comparing the defaults v. han/adapt, so I didn't bother posting the graphs.

image

image

image

awlauria avatar Jun 14 '22 14:06 awlauria

I seem to be running into issues running with han/adapt with the imb benchmarks, for example (with --map-by node)

# Iscatter

#------------------------------------------------------------------------------------------
# Benchmarking Iscatter 
# #processes = 640 
#------------------------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]      defects
            0           11         0.00         0.00         0.00         0.00         0.00
            1           11         0.00         0.00         0.00         0.00         0.00
            2           11         0.00         0.00         0.00         0.00         0.00
            4           11         0.00         0.00         0.00         0.00         0.00
            8           11         0.00         0.00         0.00         0.00         0.00
           16           11         0.00         0.00         0.00         0.00         0.00
           32           11         0.00         0.00         0.00         0.00         0.00
           64           11         0.00         0.00         0.00         0.00         0.00
          128           11         0.00         0.00         0.00         0.00         0.00
          256           11         0.00         0.00         0.00         0.00         0.00
          512           11         0.00         0.00         0.00         0.00         0.00
         1024           11         0.00         0.00         0.00         0.00         0.00

which is slightly worrisome. Has anyone else run imb with han/adapt, and gotten actual numbers? I seem to get numbers with --map-by core, --map-by node is what triggers the above.

awlauria avatar Jun 15 '22 14:06 awlauria

@awlauria, your graphs don't have x,y labels. Is your y the same as @janjust 's y scale (ie. anything above 0 is han/adapt performing better, and anything below is worse)?

wckzhang avatar Jul 06 '22 15:07 wckzhang

Oh, foo. You are right. that isn't very clear...

It's actually the opposite. The Y axis measures the performance improvement of Han/Adapt v. the defaults, the X axis is message size. So for my graphs anything below 0 means that Han/adapts time for the same test was X% lower than the default ...So below 0 represents an improvement for Han/Adapt.

It's still on my to-do list to re-run these to confirm my findings.

awlauria avatar Jul 06 '22 15:07 awlauria

So it looks like HAN/Adapt outperform Tuned except at the largest message sizes, at least for your 6 node tests.

wckzhang avatar Jul 06 '22 16:07 wckzhang

Correct. I would like confirmation of that. I will re-run these numbers, perhaps at a slightly larger scale. Will try to do that by the end of this week.

awlauria avatar Jul 06 '22 16:07 awlauria

I'm also attaching some benchmarks I performed at some point. I have only experimented with bcast, reduce, allreduce, barrier.

Settings

v5.0.x (rc7, I believe)
map by & bind to core
pml=ob1 btl=sm,uct smsc=xpmem

For the fine-tuned configurations:

tuned fine-tuned

use_dynamic_rules=true

bcast_algorithm=7 (knomial)
bcast_algorithm_segmentsize=128K
bcast_algorithm_knomial_radix=2

reduce_algorithm=6 (in-order binary)
reduce_algorithm_segmentsize=128K

allreduce_algorithm=2 (nonoverlapping (reduce+bcast))

adapt fine-tuned

bcast_algorithm=2 (in_order_binomial)
bcast_segment_size=128K

reduce_algorithm=2 (in_order_binomial)
reduce_segment_size=128K

The <component>+<component> configurations imply HAN (format: <up module>+<low module>)

Full collection: plots.tar.gz

Some of them:

dp-dam-bcast-8x tie-allreduce-5x tie-reduce-5x dp-dam-barrier-10x

gkatev avatar Aug 11 '22 10:08 gkatev

FYI at the moment, these 3 issues can impact HAN:

#10335 #10456 #10458

So if you attempt to use the MCA parameters and adjust the chosen sub-components, I suggest applying the fixes and/or verifying that the expected components are actually used.

gkatev avatar Aug 18 '22 21:08 gkatev

Sorry for the delay but here are some measurements for HAN on Hawk. I measured several different configurations, including 8, 24, 48, and 64 processes per node. I also ran with coll/sm as the backend but the differences seem minor. I also increased the HAN segment size to 256k (and did the same for coll/sm), which seems to have a positive impact on larger data sizes.

Takeaway: there are certain configuration where coll/tuned is faster than coll/han:

  1. For small messages at 1k procs on 16 nodes. I assume there is some algorithm change that plays nicely with the way OSU measures time.
  2. coll/han seems to consistently get slower for larger data sizes. The increased segment size has helped here but further increasing the segment size didn't have an impact for me. That needs some more investigation.
  3. Overall, coll/han shows consistent performance whereas coll/tuned has huge variations between configuration. I assume this is an artifact of the opaque tuning decisions.

Unfortunately, not all runs were successful (all runs in that job aborted).

8 Procs per node:

reduce_8_osu_han_8_8_1752645 hawk-pbs5 pdf-1 reduce_8_osu_han_8_8_1752645 hawk-pbs5 pdf-2 reduce_8_osu_han_8_8_1752645 hawk-pbs5 pdf-3

24 Procs per node:

reduce_24_osu_han_8_24_1752649 hawk-pbs5 pdf-01 reduce_24_osu_han_8_24_1752649 hawk-pbs5 pdf-02 reduce_24_osu_han_8_24_1752649 hawk-pbs5 pdf-03 reduce_24_osu_han_8_24_1752649 hawk-pbs5 pdf-04

48 Procs per node:

reduce_48_osu_han_8_48_1752653 hawk-pbs5 pdf-1 reduce_48_osu_han_8_48_1752653 hawk-pbs5 pdf-2 reduce_48_osu_han_8_48_1752653 hawk-pbs5 pdf-3

64 Procs per node:

reduce_64_osu_han_8_64_1752657 hawk-pbs5 pdf-1 reduce_64_osu_han_8_64_1752657 hawk-pbs5 pdf-2 reduce_64_osu_han_8_64_1752657 hawk-pbs5 pdf-3 reduce_64_osu_han_8_64_1752774 hawk-pbs5 pdf-04

Here is how I configured my runs:

  • coll/tuned: mpirun --mca coll ^hcoll --rank-by ${rankby} $mapby -N $npn -n $nprocs --bind-to core --mca coll_han_priority 0 --mca coll_hcoll_enable 0
  • coll/han: mpirun --mca coll ^hcoll --rank-by ${rankby} $mapby -N $npn -n $nprocs --bind-to core --mca coll_han_priority 100 --mca coll_han_reduce_segsize $((256*1024)) --mca coll_hcoll_enable 0
  • coll/han with coll/sm: mpirun --rank-by ${rankby} $mapby -N $npn -n $((nprocs)) --bind-to core --mca coll_han_priority 100 --mca coll_hcoll_enable 0 --mca coll_han_reduce_segsize $((256*1024)) --mca coll_sm_priority 80 --mca coll_sm_fragment_size $((260*1024))

devreal avatar Aug 25 '22 03:08 devreal

Closed by #11362 and #11389

gpaulsen avatar Feb 14 '23 14:02 gpaulsen