ucc
ucc copied to clipboard
TL/UCP: Add support for active_set broadcast with knomial and dbt for size greater than 2
Active set is a sort of lightweight subcommunicator for TL/UCP and TL/NCCL. It was originally used as a hack that enables point to point communication. This PR:
- Removes the restriction that the active set comm size is 2
- Fixes a bug that the user tag was not being properly read
- Adds support for active set double binary tree broadcast in TL/UCP
- Updates the active set gtest to more thoroughly test all of the above
Can one of the admins verify this patch?
ok to test
@nsarka What is the use case for enabling this? If removing the restriction - why only limit to broadcast?
@nsarka What is the use case for enabling this? If removing the restriction - why only limit to broadcast?
This was requested by Almog from the cuBLAS team. I'm not sure the details after that. @Sergei-Lebedev do you know what Almog wanted with this?
The use-case for that is to be able perform a "broadcast" on row/col comms without having to create these comms. For linear algebra functionality, when you use 2DBC data layout, it simplifies the code and improves performance to use row/col communicators. The problem is that NCCL comm creation is very expensive. That way, we can use the active_set broadcast and the stride to perform "row/col bcast" without paying the comm split price while also enjoying the ncclGroupStart/End functionality that is not exposed through UCC.
@manjugv @Sergei-Lebedev @samnordmann Tested on @almogsegal container - works as intended
bot:retest
bot:retest
@nsarka can you please check why gtest failed in CI?
@nsarka can you please check why gtest failed in CI?
@Sergei-Lebedev it seems like the gtest passed. However, the ucc test failed with Cancelling nested steps due to timeout
. It looks like it ran for 2 hours, passed building, codestyle, and then hangs for 4 hours in UCC / Torch-UCC test
until the 6 hour timeout.
@Sergei-Lebedev , it seems we're facing the same issue (something with containers) on several PRs
@janjust I think we see two different problems. Here it's
[2024-05-07T10:35:28.799Z] [----------] 3 tests from test_context
[2024-05-07T10:35:28.799Z] [ RUN ] test_context.create_destroy
[2024-05-07T10:35:28.799Z] [ OK ] test_context.create_destroy (236 ms)
[2024-05-07T10:35:28.799Z] [ RUN ] test_context.init_multiple
[2024-05-07T10:35:29.727Z] [ OK ] test_context.init_multiple (983 ms)
[2024-05-07T10:35:29.727Z] [ RUN ] test_context.global
[2024-05-07T11:13:33.528Z] For more details, please look at:
[2024-05-07T11:13:33.528Z] /scrap/jenkins/workspace/ucc/.ci/scripts/cov-build/build-log.txt
[2024-05-07T11:13:33.528Z] Running Coverity analysis...
[2024-05-07T11:13:33.528Z] Coverity Static Analysis version 2021.12.1 on Linux 5.4.0-107-generic x86_64
[2024-05-07T11:13:33.528Z] Internal version numbers: 5269ff0e19 p-2021.12-push-625
[2024-05-07T11:13:33.528Z]
Note that test_context.global test hangs. In another PR it's different (probably CI issue)
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
@janjust I think we see two different problems. Here it's
[2024-05-07T10:35:28.799Z] [----------] 3 tests from test_context [2024-05-07T10:35:28.799Z] [ RUN ] test_context.create_destroy [2024-05-07T10:35:28.799Z] [ OK ] test_context.create_destroy (236 ms) [2024-05-07T10:35:28.799Z] [ RUN ] test_context.init_multiple [2024-05-07T10:35:29.727Z] [ OK ] test_context.init_multiple (983 ms) [2024-05-07T10:35:29.727Z] [ RUN ] test_context.global [2024-05-07T11:13:33.528Z] For more details, please look at: [2024-05-07T11:13:33.528Z] /scrap/jenkins/workspace/ucc/.ci/scripts/cov-build/build-log.txt [2024-05-07T11:13:33.528Z] Running Coverity analysis... [2024-05-07T11:13:33.528Z] Coverity Static Analysis version 2021.12.1 on Linux 5.4.0-107-generic x86_64 [2024-05-07T11:13:33.528Z] Internal version numbers: 5269ff0e19 p-2021.12-push-625 [2024-05-07T11:13:33.528Z]
Note that test_context.global test hangs. In another PR it's different (probably CI issue)
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
Hi Sergey, in the other PR (https://github.com/openucx/ucc/pull/958), @B-a-S fixed and reran the failing pipeline. He posted a link to the new log (http://hpc-master.lab.mtl.com:8080/job/ucc/3282/) which is hanging in the same test_context.global
test this PR is hanging.
@Sergei-Lebedev I ran the hanging gtest manually with my nsarka/active-set-broadcast branch and it passed, which suggests this is a CI issue:
[ RUN ] test_context.global
[ OK ] test_context.global (2061 ms)
@Sergei-Lebedev Hey Sergey, I updated the PR so that the root is in terms of the ucc_team only if the active_set flag was set on coll_args. This fixed the hang in tl/mlx5 in the ucc_service_coll_bcast used for broadcasting protection domain details, which hardcodes the root to 0 in terms of the node-level subset instead of whatever that rank is in the ucc team. tl/sharp and I think some others do this too.