ompi Offload reduction operations to accelerator devices

This PR is an attempt to offload reduction operations in MPI_Allreduce to accelerator devices if the input buffer is located on a device.

A few notes:

There is a heuristic to determine when to launch a kernel for the reduction and when to pull the data to the host and perform the reduction there. That has not been well tested and the parameters probably need to be determined on startup.
We need to pass streams through the call hierarchy so that copy and kernel launches can be stream ordered. That requires changes in the operation API.
Data movement on the device is expensive so I adjusted the algorithms to use the 3buff variants of the ops whenever possible, instead of moving data explicitly. That turned out to be beneficial on the host as well, esp for larger reductions.
This is still WIP but I'm hoping that it can serve as a starting point for others working on device integration, as I have currently run out of time. I will try to rebase and fix conflicts soon.

Feb 06 '24 20:02 devreal

@devreal thank you for all of this work! Can I make a suggestion? This is a massive pr as it is at the moment. Could we try to break it down into multiple smaller pieces, that are more manageable? E.g.

a pr that contains the changes to the accelerator framework components
a pr that contains the changes to the op framework
...
the last one probably being the changes required to pull everything together and use the code

Ideally, even if a new feature in one of the components is not used initially, it can be reviewed and resolved independently, and if we do it right it shouldn't cause any issues as long as its not used. I am more than happy to assist/help with that process if you want.

Feb 08 '24 14:02 edgargabriel

@devreal Can you share how you configure the build? It seems that the C++ dependency is wrong when I build it.

Feb 14 '24 18:02 jiaxiyan

@edgargabriel I agree, this should be split up. I will start with the accelerator framework.

Feb 20 '24 15:02 devreal

@devreal We built with libfabric and collected some performance data of osu-micro-benchmarks on GPU instances(p4d.24xlarge).

osu_reduce latency on 1 single node with 96 processes per node

$ mpirun -np 96 --use-hwthread-cpus --mca pml ob1 -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f

# OSU MPI-CUDA Reduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      50.89             18.27             89.23        1000
2                      50.20             18.24             88.37        1000
4                      49.70             17.48            105.89        1000
8                      46.01             16.58             91.06        1000
16                     44.55             16.75             79.57        1000
32                     45.75             16.70             81.72        1000
64                     45.17             16.56             81.88        1000
128                    45.51             16.62             81.47        1000
256                    45.46             16.61             81.15        1000
512                    46.62             17.25             83.92        1000
1024                   46.39             17.17             82.71        1000
2048                   48.14             18.45             87.07        1000
4096                   59.19             23.05             97.78        1000
8192                   72.53             26.28            133.33        1000
16384                  91.32             33.93            265.90         100
32768                 115.79             42.75            200.79         100
65536                 183.01             86.72            319.37         100
131072                337.60            153.89            597.25         100
262144                653.80            328.93           1140.21         100
524288               1413.32            883.30           2473.55         100
1048576              4315.48           1318.93           8312.18         100


$ mpirun -np 96 --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f

# OSU MPI-CUDA Reduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                      52.15             23.16             92.76        1000
2                      52.96             22.55             96.52        1000
4                      51.36             21.01            103.20        1000
8                      47.36             20.21             85.74        1000
16                     45.91             20.14             78.97        1000
32                     45.37             20.05             78.47        1000
64                     45.63             17.62             79.62        1000
128                    46.00             17.89             81.53        1000
256                    46.86             18.01             84.94        1000
512                    46.27             18.25             82.34        1000
1024                   54.11             18.39            113.24        1000
2048                   48.34             18.96             87.42        1000
4096                   50.72             19.41             92.88        1000
8192                   68.85             42.03            119.58        1000
16384                  81.37             48.91            145.23         100
32768                 114.14             65.15            215.77         100
65536                 182.49            109.55            336.56         100
131072                331.24            182.98            600.06         100
262144                652.47            374.06           1201.86         100
524288               1373.64            862.05           2473.60         100
1048576              3890.68           1835.44           7693.43         100

osu_reduce latency on 2 nodes with 96 processes per node

$ mpirun -np 192 --hostfile /home/ec2-user/PortaFiducia/hostfile --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f

# OSU MPI-CUDA Reduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                      53.34             24.80            158.96        1000
2                      52.71             23.64            127.55        1000
4                      52.65             23.31            125.60        1000
8                      52.79             22.93            128.73        1000
16                     52.97             23.85            129.14        1000
32                     53.07             23.99            131.40        1000
64                     53.17             23.27            131.36        1000
128                    53.91             23.22            160.13        1000
256                    53.96             23.56            131.35        1000
512                    54.57             23.78            136.56        1000
1024                   55.48             24.05            134.61        1000
2048                   58.17             25.93            167.34        1000
4096                   60.06             28.02            144.41        1000
8192                   80.41             53.82            175.23        1000
16384                  96.45             64.75            188.37         100
32768                 130.03             81.92            257.46         100
65536                 216.15            137.95            670.17         100
131072                458.93            312.95           1157.73         100
262144               1099.14            807.35           2219.35         100
524288               3153.31           2125.70           4677.10         100
1048576              7281.71           6291.95           8827.05         100

osu_allreduce latency on 1 single node with 96 processes per node

$ mpirun -np 96 --use-hwthread-cpus --mca pml ob1 -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f


# OSU MPI-CUDA Allreduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    1901.97           1895.70           1909.23        1000
2                    1866.46           1856.51           1890.17        1000
4                    1849.15           1846.66           1864.29        1000
8                    1865.48           1857.94           1869.82        1000
16                   1889.69           1887.12           1892.54        1000
32                   1887.29           1874.57           1892.87        1000
64                   1866.39           1862.39           1878.10        1000
128                  1902.69           1881.43           1950.76        1000
256                  6846.33           6153.07           7516.33        1000
512                  4960.26           4586.52           5442.11        1000
1024                 1869.71           1867.50           1879.32        1000
2048                 1860.47           1850.71           1869.47        1000
4096                 1920.09           1914.88           1925.71        1000
8192                 1946.90           1940.31           1954.75        1000
16384                1954.01           1941.61           1966.76         100
32768                2169.78           2095.39           2231.91         100
65536                2208.15           2149.11           2267.01         100
131072               1294.21           1152.11           1401.04         100
262144               2070.86           1986.83           2133.47         100
524288               4630.58           4475.43           4727.98         100
1048576              9268.92           8822.77           9553.14         100


$ mpirun -np 96 --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f

# OSU MPI-CUDA Allreduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    1888.06           1874.91           1902.12        1000
2                    1870.06           1867.39           1881.06        1000
4                    1886.02           1881.17           1890.68        1000
8                    1885.91           1878.96           1909.19        1000
16                   1873.81           1852.46           1890.93        1000
32                   1866.67           1857.62           1887.08        1000
64                   1854.75           1848.20           1867.71        1000
128                  1856.85           1851.18           1865.18        1000
256                  1886.99           1879.35           1894.61        1000
512                  1879.01           1875.44           1888.82        1000
1024                 1879.15           1867.80           1897.18        1000
2048                 1886.56           1876.13           1905.26        1000
4096                 1891.85           1885.14           1909.16        1000
8192                 1964.46           1946.55           1980.99        1000
16384                1996.35           1977.00           2010.00         100
32768                2081.54           2046.57           2107.62         100
65536                2267.33           2206.32           2321.61         100
131072               1600.58           1421.01           1698.15         100
262144               2057.70           1933.97           2141.18         100
524288               4341.36           4102.91           4491.09         100
1048576              9190.83           8568.45           9572.64         100

osu_allreduce latency on 2 nodes with 96 processes per node

$ mpirun -np 192 --hostfile /home/ec2-user/PortaFiducia/hostfile --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f

# OSU MPI-CUDA Allreduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    1918.58           1898.68           1950.81        1000
2                    1923.21           1902.49           1956.78        1000
4                    1919.62           1898.32           1954.29        1000
8                    1966.44           1945.62           1986.83        1000
16                   1932.10           1911.17           1967.35        1000
32                   1905.98           1885.25           1945.11        1000
64                   1938.28           1916.89           1980.00        1000
128                  1922.34           1898.96           1957.66        1000
256                  1923.65           1900.70           1959.54        1000
512                  2026.75           2003.67           2056.36        1000
1024                 1939.01           1916.19           1975.81        1000
2048                 1931.89           1907.90           1965.60        1000
4096                 1944.01           1919.52           1970.58        1000
8192                 1978.66           1956.17           2020.18        1000
16384                1972.88           1946.22           1994.44         100
32768                2315.21           2258.95           2487.19         100
65536                2493.70           2352.68           2578.54         100
131072               1861.39           1687.25           1978.94         100
262144               3510.34           3260.06           3757.74         100
524288               6721.11           6278.76           6963.19         100
1048576             14090.28          13764.27          14250.96         100

We found ireduce and iallreduce segfault with

[ip-172-31-26-243.ec2.internal:06486] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.ip-172-31-26-243.1000/jf.0/1369899008/shared_mem_cuda_pool.ip-172-31-26-243 could be created.
[ip-172-31-26-243.ec2.internal:06486] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728 
ERROR: No suitable module for op MPI_SUM on type MPI_CHAR found for device memory!

Feb 23 '24 00:02 jiaxiyan

On a single node with UCX

$  mpirun -np 96  --use-hwthread-cpus --mca pml ucx /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f


# OSU MPI-CUDA Reduce Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                      49.54             41.59             57.18        1000
2                      49.45             41.00             58.08        1000
4                      49.67             41.47             57.88        1000
8                      50.98             40.16             69.13        1000
16                     49.80             40.49             59.02        1000
32                     50.35             42.51             58.11        1000
64                     52.15             42.93             66.98        1000
128                    52.04             42.93             59.65        1000
256                    51.89             42.68             60.94        1000
512                    51.82             42.38             60.71        1000
1024                   53.97             42.73             72.85        1000
2048                   54.96             45.69             71.26        1000
4096                   56.78             48.69             66.27        1000
8192                   63.23             54.73             84.31        1000
16384                  77.46             65.07             89.25         100
32768                 104.21             93.89            117.12         100
65536                 168.33            148.43            191.16         100
131072                321.52            283.51            366.13         100
262144                713.00            670.85            759.71         100
524288               1485.87           1429.74           1556.93         100
1048576              3981.97           3750.12           4288.07         100

$ mpirun -n 96  --use-hwthread-cpus --mca pml ucx /home/ec2-user/osu-micro-benchmarks/install/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f

# OSU MPI-CUDA Allreduce Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    1100.07            643.62           1575.28        1000
2                    1117.94            645.60           1569.19        1000
4                    1115.97            666.85           1574.60        1000
8                    1105.40            658.64           1532.47        1000
16                   1142.66            699.44           1610.27        1000
32                   1102.44            773.93           1468.40        1000
64                   1085.40            766.52           1442.15        1000
128                  1115.23            851.01           1473.89        1000
256                  1098.70            839.22           1431.71        1000
512                  1104.72            814.51           1441.95        1000
1024                 1112.75            857.42           1431.18        1000
2048                 1102.28            790.27           1490.52        1000
4096                 1126.28            767.91           1545.33        1000
8192                 1170.84            749.98           1673.32        1000
16384                1403.39            737.39           1912.97         100
32768                1371.41            728.90           2004.60         100
65536                1659.78            879.67           2435.50         100
131072               1407.15           1239.39           1499.39         100
262144               2490.19           2254.67           2633.81         100
524288               4649.12           4118.11           5064.59         100
1048576             10059.62           8996.23          10887.01         100

Feb 23 '24 00:02 jiaxiyan

ompi ompi copied to clipboard

Offload reduction operations to accelerator devices

ompi
ompi copied to clipboard