ucc
ucc copied to clipboard
TL/UCP: Add Sliding Window allreduce implementation
This PR is the second in a series of two that will add Sliding Window allreduce to UCC. This one implements the function stubs left by the first PR (https://github.com/openucx/ucc/pull/902)
Can one of the admins verify this patch?
ok to test
@Sergei-Lebedev It seems like the same ucc test is failing as the active_set. However, this one fails because of wrong cuda versions: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
@artemry-nv Seems like we have CI issues on this PR, unrelated to the PR
@B-a-S please take a look
@B-a-S please take a look
I've rerun the job. Take a look http://hpc-master.lab.mtl.com:8080/job/ucc/3282/
The ucc gtest failed on
[ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4
[2024-05-16T18:13:52.173Z] [swx-clx01:196 :0:196] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
On this PR, I logged into the CI node and opened a shell inside the container that was running this gtest. Running it myself it passed:
swx-jenkins@swx-clx01:/opt/nvidia/src/ucc/build/test/gtest$ stdbuf -e0 -o0 /opt/nvidia/src/ucc/build/test/gtest/gtest --gtest_filter=*test_tl_mlx5_dm.MemcpyToDeviceMemory*
Note: Google Test filter = *test_tl_mlx5_dm.MemcpyToDeviceMemory*
[==========] Running 8 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 8 tests from test_tl_mlx5_dm
[ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0
[ SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0 (63 ms)
[ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/1
[ OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/1 (36 ms)
[ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/2
[ SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/2 (35 ms)
[ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/3
[ OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/3 (35 ms)
[ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4
[ OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4 (38 ms)
[ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/5
[ SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/5 (37 ms)
[ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/6
[ OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/6 (48 ms)
[ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/7
[ OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/7 (59 ms)
[----------] 8 tests from test_tl_mlx5_dm (351 ms total)
[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (351 ms total)
[ PASSED ] 5 tests.
[ SKIPPED ] 3 tests.
I updated the PR. The get/reduce/put phase and the barrier part of the algorithm are now run via schedule. I left the allgather phase the way it was inside of the get/reduce/put phase because once Ferrol's PR goes in, I will be removing the allgather and using that instead for key exchange.
Also, I moved some code to two new files, tl_ucp_dpu_offload.c
and tl_ucp_dpu_offload.h
. These have common code that will be used for the rest of the collectives we're planning on implementing.
Please wait to review, there are some failures I should fix first.
bot:retest
@samnordmann The PR is ready for review, thank you. Please note that the allgather task is still part of the algorithm. Once Ferrol's PR goes in I will convert the code to use that for the import/allgather phase of the algorithm. I also left the reduction as part of the algorithm, since it would involve splitting the gets and puts into tasks of their own as well. There would be thousands of these tasks for large message sizes.
@Sergei-Lebedev When you say follow up changes - are you talking in a separate PR after this is merged? If so, I'll open up an issue and tag this so we don't forget.
@Sergei-Lebedev @samnordmann The PR is ready to be merged