ucc TL/UCP: Add Sliding Window allreduce implementation

This PR is the second in a series of two that will add Sliding Window allreduce to UCC. This one implements the function stubs left by the first PR (https://github.com/openucx/ucc/pull/902)

Apr 10 '24 14:04 nsarka

Can one of the admins verify this patch?

Apr 10 '24 15:04 swx-jenkins3

ok to test

Apr 11 '24 17:04 Sergei-Lebedev

@Sergei-Lebedev It seems like the same ucc test is failing as the active_set. However, this one fails because of wrong cuda versions: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

May 08 '24 19:05 nsarka

@artemry-nv Seems like we have CI issues on this PR, unrelated to the PR

May 09 '24 17:05 janjust

@B-a-S please take a look

May 09 '24 17:05 artemry-nv

@B-a-S please take a look

I've rerun the job. Take a look http://hpc-master.lab.mtl.com:8080/job/ucc/3282/

May 10 '24 01:05 B-a-S

The ucc gtest failed on

[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4
[2024-05-16T18:13:52.173Z] [swx-clx01:196  :0:196] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

May 16 '24 18:05 nsarka

On this PR, I logged into the CI node and opened a shell inside the container that was running this gtest. Running it myself it passed:

swx-jenkins@swx-clx01:/opt/nvidia/src/ucc/build/test/gtest$ stdbuf -e0 -o0 /opt/nvidia/src/ucc/build/test/gtest/gtest --gtest_filter=*test_tl_mlx5_dm.MemcpyToDeviceMemory*
Note: Google Test filter = *test_tl_mlx5_dm.MemcpyToDeviceMemory*
[==========] Running 8 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 8 tests from test_tl_mlx5_dm
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0
[  SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0 (63 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/1
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/1 (36 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/2
[  SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/2 (35 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/3
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/3 (35 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4 (38 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/5
[  SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/5 (37 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/6
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/6 (48 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/7
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/7 (59 ms)
[----------] 8 tests from test_tl_mlx5_dm (351 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (351 ms total)
[  PASSED  ] 5 tests.
[  SKIPPED ] 3 tests.

May 23 '24 18:05 nsarka

I updated the PR. The get/reduce/put phase and the barrier part of the algorithm are now run via schedule. I left the allgather phase the way it was inside of the get/reduce/put phase because once Ferrol's PR goes in, I will be removing the allgather and using that instead for key exchange.

Also, I moved some code to two new files, tl_ucp_dpu_offload.c and tl_ucp_dpu_offload.h. These have common code that will be used for the rest of the collectives we're planning on implementing.

May 29 '24 16:05 nsarka

Please wait to review, there are some failures I should fix first.

May 29 '24 16:05 nsarka

bot:retest

May 30 '24 00:05 artemry-nv

@samnordmann The PR is ready for review, thank you. Please note that the allgather task is still part of the algorithm. Once Ferrol's PR goes in I will convert the code to use that for the import/allgather phase of the algorithm. I also left the reduction as part of the algorithm, since it would involve splitting the gets and puts into tasks of their own as well. There would be thousands of these tasks for large message sizes.

May 31 '24 21:05 nsarka

@Sergei-Lebedev When you say follow up changes - are you talking in a separate PR after this is merged? If so, I'll open up an issue and tag this so we don't forget.

Jun 10 '24 13:06 janjust

@Sergei-Lebedev @samnordmann The PR is ready to be merged

Jun 10 '24 18:06 nsarka

ucc ucc copied to clipboard

TL/UCP: Add Sliding Window allreduce implementation

ucc
ucc copied to clipboard