ucc icon indicating copy to clipboard operation
ucc copied to clipboard

TL/UCP: Add Sliding Window allreduce implementation

Open nsarka opened this issue 10 months ago • 7 comments

This PR is the second in a series of two that will add Sliding Window allreduce to UCC. This one implements the function stubs left by the first PR (https://github.com/openucx/ucc/pull/902)

nsarka avatar Apr 10 '24 14:04 nsarka

Can one of the admins verify this patch?

swx-jenkins3 avatar Apr 10 '24 15:04 swx-jenkins3

ok to test

Sergei-Lebedev avatar Apr 11 '24 17:04 Sergei-Lebedev

@Sergei-Lebedev It seems like the same ucc test is failing as the active_set. However, this one fails because of wrong cuda versions: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

nsarka avatar May 08 '24 19:05 nsarka

@artemry-nv Seems like we have CI issues on this PR, unrelated to the PR

janjust avatar May 09 '24 17:05 janjust

@B-a-S please take a look

artemry-nv avatar May 09 '24 17:05 artemry-nv

@B-a-S please take a look

I've rerun the job. Take a look http://hpc-master.lab.mtl.com:8080/job/ucc/3282/

B-a-S avatar May 10 '24 01:05 B-a-S

The ucc gtest failed on

[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4
[2024-05-16T18:13:52.173Z] [swx-clx01:196  :0:196] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

nsarka avatar May 16 '24 18:05 nsarka

On this PR, I logged into the CI node and opened a shell inside the container that was running this gtest. Running it myself it passed:

swx-jenkins@swx-clx01:/opt/nvidia/src/ucc/build/test/gtest$ stdbuf -e0 -o0 /opt/nvidia/src/ucc/build/test/gtest/gtest --gtest_filter=*test_tl_mlx5_dm.MemcpyToDeviceMemory*
Note: Google Test filter = *test_tl_mlx5_dm.MemcpyToDeviceMemory*
[==========] Running 8 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 8 tests from test_tl_mlx5_dm
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0
[  SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0 (63 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/1
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/1 (36 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/2
[  SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/2 (35 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/3
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/3 (35 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4 (38 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/5
[  SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/5 (37 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/6
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/6 (48 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/7
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/7 (59 ms)
[----------] 8 tests from test_tl_mlx5_dm (351 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (351 ms total)
[  PASSED  ] 5 tests.
[  SKIPPED ] 3 tests.

nsarka avatar May 23 '24 18:05 nsarka

I updated the PR. The get/reduce/put phase and the barrier part of the algorithm are now run via schedule. I left the allgather phase the way it was inside of the get/reduce/put phase because once Ferrol's PR goes in, I will be removing the allgather and using that instead for key exchange.

Also, I moved some code to two new files, tl_ucp_dpu_offload.c and tl_ucp_dpu_offload.h. These have common code that will be used for the rest of the collectives we're planning on implementing.

nsarka avatar May 29 '24 16:05 nsarka

Please wait to review, there are some failures I should fix first.

nsarka avatar May 29 '24 16:05 nsarka

bot:retest

artemry-nv avatar May 30 '24 00:05 artemry-nv

@samnordmann The PR is ready for review, thank you. Please note that the allgather task is still part of the algorithm. Once Ferrol's PR goes in I will convert the code to use that for the import/allgather phase of the algorithm. I also left the reduction as part of the algorithm, since it would involve splitting the gets and puts into tasks of their own as well. There would be thousands of these tasks for large message sizes.

nsarka avatar May 31 '24 21:05 nsarka

@Sergei-Lebedev When you say follow up changes - are you talking in a separate PR after this is merged? If so, I'll open up an issue and tag this so we don't forget.

janjust avatar Jun 10 '24 13:06 janjust

@Sergei-Lebedev @samnordmann The PR is ready to be merged

nsarka avatar Jun 10 '24 18:06 nsarka