ompi
ompi copied to clipboard
Offload reduction operations to accelerator devices
This PR is an attempt to offload reduction operations in MPI_Allreduce to accelerator devices if the input buffer is located on a device.
A few notes:
- There is a heuristic to determine when to launch a kernel for the reduction and when to pull the data to the host and perform the reduction there. That has not been well tested and the parameters probably need to be determined on startup.
- We need to pass streams through the call hierarchy so that copy and kernel launches can be stream ordered. That requires changes in the operation API.
- Data movement on the device is expensive so I adjusted the algorithms to use the 3buff variants of the ops whenever possible, instead of moving data explicitly. That turned out to be beneficial on the host as well, esp for larger reductions.
- This is still WIP but I'm hoping that it can serve as a starting point for others working on device integration, as I have currently run out of time. I will try to rebase and fix conflicts soon.
@devreal thank you for all of this work! Can I make a suggestion? This is a massive pr as it is at the moment. Could we try to break it down into multiple smaller pieces, that are more manageable? E.g.
- a pr that contains the changes to the accelerator framework components
- a pr that contains the changes to the op framework
- ...
- the last one probably being the changes required to pull everything together and use the code
Ideally, even if a new feature in one of the components is not used initially, it can be reviewed and resolved independently, and if we do it right it shouldn't cause any issues as long as its not used. I am more than happy to assist/help with that process if you want.
@devreal Can you share how you configure the build? It seems that the C++ dependency is wrong when I build it.
@edgargabriel I agree, this should be split up. I will start with the accelerator framework.
@devreal We built with libfabric and collected some performance data of osu-micro-benchmarks on GPU instances(p4d.24xlarge).
- osu_reduce latency on 1 single node with 96 processes per node
$ mpirun -np 96 --use-hwthread-cpus --mca pml ob1 -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f
# OSU MPI-CUDA Reduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 50.89 18.27 89.23 1000
2 50.20 18.24 88.37 1000
4 49.70 17.48 105.89 1000
8 46.01 16.58 91.06 1000
16 44.55 16.75 79.57 1000
32 45.75 16.70 81.72 1000
64 45.17 16.56 81.88 1000
128 45.51 16.62 81.47 1000
256 45.46 16.61 81.15 1000
512 46.62 17.25 83.92 1000
1024 46.39 17.17 82.71 1000
2048 48.14 18.45 87.07 1000
4096 59.19 23.05 97.78 1000
8192 72.53 26.28 133.33 1000
16384 91.32 33.93 265.90 100
32768 115.79 42.75 200.79 100
65536 183.01 86.72 319.37 100
131072 337.60 153.89 597.25 100
262144 653.80 328.93 1140.21 100
524288 1413.32 883.30 2473.55 100
1048576 4315.48 1318.93 8312.18 100
$ mpirun -np 96 --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f
# OSU MPI-CUDA Reduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 52.15 23.16 92.76 1000
2 52.96 22.55 96.52 1000
4 51.36 21.01 103.20 1000
8 47.36 20.21 85.74 1000
16 45.91 20.14 78.97 1000
32 45.37 20.05 78.47 1000
64 45.63 17.62 79.62 1000
128 46.00 17.89 81.53 1000
256 46.86 18.01 84.94 1000
512 46.27 18.25 82.34 1000
1024 54.11 18.39 113.24 1000
2048 48.34 18.96 87.42 1000
4096 50.72 19.41 92.88 1000
8192 68.85 42.03 119.58 1000
16384 81.37 48.91 145.23 100
32768 114.14 65.15 215.77 100
65536 182.49 109.55 336.56 100
131072 331.24 182.98 600.06 100
262144 652.47 374.06 1201.86 100
524288 1373.64 862.05 2473.60 100
1048576 3890.68 1835.44 7693.43 100
- osu_reduce latency on 2 nodes with 96 processes per node
$ mpirun -np 192 --hostfile /home/ec2-user/PortaFiducia/hostfile --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f
# OSU MPI-CUDA Reduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 53.34 24.80 158.96 1000
2 52.71 23.64 127.55 1000
4 52.65 23.31 125.60 1000
8 52.79 22.93 128.73 1000
16 52.97 23.85 129.14 1000
32 53.07 23.99 131.40 1000
64 53.17 23.27 131.36 1000
128 53.91 23.22 160.13 1000
256 53.96 23.56 131.35 1000
512 54.57 23.78 136.56 1000
1024 55.48 24.05 134.61 1000
2048 58.17 25.93 167.34 1000
4096 60.06 28.02 144.41 1000
8192 80.41 53.82 175.23 1000
16384 96.45 64.75 188.37 100
32768 130.03 81.92 257.46 100
65536 216.15 137.95 670.17 100
131072 458.93 312.95 1157.73 100
262144 1099.14 807.35 2219.35 100
524288 3153.31 2125.70 4677.10 100
1048576 7281.71 6291.95 8827.05 100
- osu_allreduce latency on 1 single node with 96 processes per node
$ mpirun -np 96 --use-hwthread-cpus --mca pml ob1 -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f
# OSU MPI-CUDA Allreduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 1901.97 1895.70 1909.23 1000
2 1866.46 1856.51 1890.17 1000
4 1849.15 1846.66 1864.29 1000
8 1865.48 1857.94 1869.82 1000
16 1889.69 1887.12 1892.54 1000
32 1887.29 1874.57 1892.87 1000
64 1866.39 1862.39 1878.10 1000
128 1902.69 1881.43 1950.76 1000
256 6846.33 6153.07 7516.33 1000
512 4960.26 4586.52 5442.11 1000
1024 1869.71 1867.50 1879.32 1000
2048 1860.47 1850.71 1869.47 1000
4096 1920.09 1914.88 1925.71 1000
8192 1946.90 1940.31 1954.75 1000
16384 1954.01 1941.61 1966.76 100
32768 2169.78 2095.39 2231.91 100
65536 2208.15 2149.11 2267.01 100
131072 1294.21 1152.11 1401.04 100
262144 2070.86 1986.83 2133.47 100
524288 4630.58 4475.43 4727.98 100
1048576 9268.92 8822.77 9553.14 100
$ mpirun -np 96 --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f
# OSU MPI-CUDA Allreduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 1888.06 1874.91 1902.12 1000
2 1870.06 1867.39 1881.06 1000
4 1886.02 1881.17 1890.68 1000
8 1885.91 1878.96 1909.19 1000
16 1873.81 1852.46 1890.93 1000
32 1866.67 1857.62 1887.08 1000
64 1854.75 1848.20 1867.71 1000
128 1856.85 1851.18 1865.18 1000
256 1886.99 1879.35 1894.61 1000
512 1879.01 1875.44 1888.82 1000
1024 1879.15 1867.80 1897.18 1000
2048 1886.56 1876.13 1905.26 1000
4096 1891.85 1885.14 1909.16 1000
8192 1964.46 1946.55 1980.99 1000
16384 1996.35 1977.00 2010.00 100
32768 2081.54 2046.57 2107.62 100
65536 2267.33 2206.32 2321.61 100
131072 1600.58 1421.01 1698.15 100
262144 2057.70 1933.97 2141.18 100
524288 4341.36 4102.91 4491.09 100
1048576 9190.83 8568.45 9572.64 100
- osu_allreduce latency on 2 nodes with 96 processes per node
$ mpirun -np 192 --hostfile /home/ec2-user/PortaFiducia/hostfile --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f
# OSU MPI-CUDA Allreduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 1918.58 1898.68 1950.81 1000
2 1923.21 1902.49 1956.78 1000
4 1919.62 1898.32 1954.29 1000
8 1966.44 1945.62 1986.83 1000
16 1932.10 1911.17 1967.35 1000
32 1905.98 1885.25 1945.11 1000
64 1938.28 1916.89 1980.00 1000
128 1922.34 1898.96 1957.66 1000
256 1923.65 1900.70 1959.54 1000
512 2026.75 2003.67 2056.36 1000
1024 1939.01 1916.19 1975.81 1000
2048 1931.89 1907.90 1965.60 1000
4096 1944.01 1919.52 1970.58 1000
8192 1978.66 1956.17 2020.18 1000
16384 1972.88 1946.22 1994.44 100
32768 2315.21 2258.95 2487.19 100
65536 2493.70 2352.68 2578.54 100
131072 1861.39 1687.25 1978.94 100
262144 3510.34 3260.06 3757.74 100
524288 6721.11 6278.76 6963.19 100
1048576 14090.28 13764.27 14250.96 100
We found ireduce and iallreduce segfault with
[ip-172-31-26-243.ec2.internal:06486] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.ip-172-31-26-243.1000/jf.0/1369899008/shared_mem_cuda_pool.ip-172-31-26-243 could be created.
[ip-172-31-26-243.ec2.internal:06486] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
ERROR: No suitable module for op MPI_SUM on type MPI_CHAR found for device memory!
On a single node with UCX
$ mpirun -np 96 --use-hwthread-cpus --mca pml ucx /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f
# OSU MPI-CUDA Reduce Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 49.54 41.59 57.18 1000
2 49.45 41.00 58.08 1000
4 49.67 41.47 57.88 1000
8 50.98 40.16 69.13 1000
16 49.80 40.49 59.02 1000
32 50.35 42.51 58.11 1000
64 52.15 42.93 66.98 1000
128 52.04 42.93 59.65 1000
256 51.89 42.68 60.94 1000
512 51.82 42.38 60.71 1000
1024 53.97 42.73 72.85 1000
2048 54.96 45.69 71.26 1000
4096 56.78 48.69 66.27 1000
8192 63.23 54.73 84.31 1000
16384 77.46 65.07 89.25 100
32768 104.21 93.89 117.12 100
65536 168.33 148.43 191.16 100
131072 321.52 283.51 366.13 100
262144 713.00 670.85 759.71 100
524288 1485.87 1429.74 1556.93 100
1048576 3981.97 3750.12 4288.07 100
$ mpirun -n 96 --use-hwthread-cpus --mca pml ucx /home/ec2-user/osu-micro-benchmarks/install/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f
# OSU MPI-CUDA Allreduce Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 1100.07 643.62 1575.28 1000
2 1117.94 645.60 1569.19 1000
4 1115.97 666.85 1574.60 1000
8 1105.40 658.64 1532.47 1000
16 1142.66 699.44 1610.27 1000
32 1102.44 773.93 1468.40 1000
64 1085.40 766.52 1442.15 1000
128 1115.23 851.01 1473.89 1000
256 1098.70 839.22 1431.71 1000
512 1104.72 814.51 1441.95 1000
1024 1112.75 857.42 1431.18 1000
2048 1102.28 790.27 1490.52 1000
4096 1126.28 767.91 1545.33 1000
8192 1170.84 749.98 1673.32 1000
16384 1403.39 737.39 1912.97 100
32768 1371.41 728.90 2004.60 100
65536 1659.78 879.67 2435.50 100
131072 1407.15 1239.39 1499.39 100
262144 2490.19 2254.67 2633.81 100
524288 4649.12 4118.11 5064.59 100
1048576 10059.62 8996.23 10887.01 100