ucc TL/UCP: Add support for active_set broadcast with knomial and dbt for size greater than 2

Active set is a sort of lightweight subcommunicator for TL/UCP and TL/NCCL. It was originally used as a hack that enables point to point communication. This PR:

Removes the restriction that the active set comm size is 2
Fixes a bug that the user tag was not being properly read
Adds support for active set double binary tree broadcast in TL/UCP
Updates the active set gtest to more thoroughly test all of the above

Feb 15 '24 18:02 nsarka

Can one of the admins verify this patch?

Feb 15 '24 18:02 swx-jenkins3

ok to test

Feb 21 '24 14:02 Sergei-Lebedev

@nsarka What is the use case for enabling this? If removing the restriction - why only limit to broadcast?

Apr 03 '24 12:04 manjugv

@nsarka What is the use case for enabling this? If removing the restriction - why only limit to broadcast?

This was requested by Almog from the cuBLAS team. I'm not sure the details after that. @Sergei-Lebedev do you know what Almog wanted with this?

Apr 03 '24 17:04 nsarka

The use-case for that is to be able perform a "broadcast" on row/col comms without having to create these comms. For linear algebra functionality, when you use 2DBC data layout, it simplifies the code and improves performance to use row/col communicators. The problem is that NCCL comm creation is very expensive. That way, we can use the active_set broadcast and the stride to perform "row/col bcast" without paying the comm split price while also enjoying the ncclGroupStart/End functionality that is not exposed through UCC.

Apr 04 '24 10:04 almogsegal

@manjugv @Sergei-Lebedev @samnordmann Tested on @almogsegal container - works as intended

Apr 16 '24 17:04 janjust

bot:retest

Apr 22 '24 13:04 janjust

bot:retest

May 07 '24 09:05 Sergei-Lebedev

@nsarka can you please check why gtest failed in CI?

May 08 '24 06:05 Sergei-Lebedev

@nsarka can you please check why gtest failed in CI?

@Sergei-Lebedev it seems like the gtest passed. However, the ucc test failed with Cancelling nested steps due to timeout. It looks like it ran for 2 hours, passed building, codestyle, and then hangs for 4 hours in UCC / Torch-UCC test until the 6 hour timeout.

May 08 '24 19:05 nsarka

@Sergei-Lebedev , it seems we're facing the same issue (something with containers) on several PRs

May 09 '24 17:05 janjust

@janjust I think we see two different problems. Here it's

[2024-05-07T10:35:28.799Z] [----------] 3 tests from test_context
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.create_destroy
[2024-05-07T10:35:28.799Z] [       OK ] test_context.create_destroy (236 ms)
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.init_multiple
[2024-05-07T10:35:29.727Z] [       OK ] test_context.init_multiple (983 ms)
[2024-05-07T10:35:29.727Z] [ RUN      ] test_context.global
[2024-05-07T11:13:33.528Z]  For more details, please look at: 
[2024-05-07T11:13:33.528Z]     /scrap/jenkins/workspace/ucc/.ci/scripts/cov-build/build-log.txt
[2024-05-07T11:13:33.528Z] Running Coverity analysis...
[2024-05-07T11:13:33.528Z] Coverity Static Analysis version 2021.12.1 on Linux 5.4.0-107-generic x86_64
[2024-05-07T11:13:33.528Z] Internal version numbers: 5269ff0e19 p-2021.12-push-625
[2024-05-07T11:13:33.528Z]

Note that test_context.global test hangs. In another PR it's different (probably CI issue)

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

May 10 '24 07:05 Sergei-Lebedev

@janjust I think we see two different problems. Here it's

[2024-05-07T10:35:28.799Z] [----------] 3 tests from test_context
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.create_destroy
[2024-05-07T10:35:28.799Z] [       OK ] test_context.create_destroy (236 ms)
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.init_multiple
[2024-05-07T10:35:29.727Z] [       OK ] test_context.init_multiple (983 ms)
[2024-05-07T10:35:29.727Z] [ RUN      ] test_context.global
[2024-05-07T11:13:33.528Z]  For more details, please look at: 
[2024-05-07T11:13:33.528Z]     /scrap/jenkins/workspace/ucc/.ci/scripts/cov-build/build-log.txt
[2024-05-07T11:13:33.528Z] Running Coverity analysis...
[2024-05-07T11:13:33.528Z] Coverity Static Analysis version 2021.12.1 on Linux 5.4.0-107-generic x86_64
[2024-05-07T11:13:33.528Z] Internal version numbers: 5269ff0e19 p-2021.12-push-625
[2024-05-07T11:13:33.528Z]

Note that test_context.global test hangs. In another PR it's different (probably CI issue)

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

Hi Sergey, in the other PR (https://github.com/openucx/ucc/pull/958), @B-a-S fixed and reran the failing pipeline. He posted a link to the new log (http://hpc-master.lab.mtl.com:8080/job/ucc/3282/) which is hanging in the same test_context.global test this PR is hanging.

May 10 '24 12:05 nsarka

@Sergei-Lebedev I ran the hanging gtest manually with my nsarka/active-set-broadcast branch and it passed, which suggests this is a CI issue:

[ RUN      ] test_context.global
[       OK ] test_context.global (2061 ms)

May 10 '24 17:05 nsarka

@Sergei-Lebedev Hey Sergey, I updated the PR so that the root is in terms of the ucc_team only if the active_set flag was set on coll_args. This fixed the hang in tl/mlx5 in the ucc_service_coll_bcast used for broadcasting protection domain details, which hardcodes the root to 0 in terms of the node-level subset instead of whatever that rank is in the ucc team. tl/sharp and I think some others do this too.

May 23 '24 15:05 nsarka

ucc ucc copied to clipboard

TL/UCP: Add support for active_set broadcast with knomial and dbt for size greater than 2

ucc
ucc copied to clipboard