ucc
ucc copied to clipboard
UCC returns an error when modifying non-existing TLs in context config
UCC returns an error when modifying non-existing TLs in context config. I understand the reasoning for failing in such case but there's no way to predict that. I'd expect one of the following behaviors:
- UCC won't fail if TL doesn't exist in the context.
- UCC should allow to query the context for existing TLs so the user would be able to modify them without failing. for example:
UCC_CHECK(ucc_context_config_query(ctx_config, "tl/cuda", "VALUE", &has_cuda));
if (has_cuda)
{
UCC_CHECK(ucc_context_config_modify(ctx_config, "tl/cuda", "TUNE", "0"));
}
@manjugv please review.
Below is how my output looks like when I run with 16 processes, it makes it harder to find what I need in the output when I run with more processes.
[1667117403.755480] [luna-0062:1724877:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.839909] [luna-0062:1724878:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.842918] [luna-0063:551408:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.857859] [luna-0062:1724865:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.862395] [luna-0062:1724874:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.869124] [luna-0062:1724863:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.876273] [luna-0062:1724870:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.878267] [luna-0063:551407:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.891668] [luna-0063:551413:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.893574] [luna-0063:551415:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.900817] [luna-0062:1724852:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.902945] [luna-0063:551410:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.908191] [luna-0062:1724876:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.910438] [luna-0063:551414:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.920437] [luna-0063:551411:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117403.929928] [luna-0063:551409:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465478] [luna-0063:551408:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465504] [luna-0063:551409:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465552] [luna-0063:551413:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465626] [luna-0063:551414:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465313] [luna-0062:1724876:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465598] [luna-0063:551411:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465696] [luna-0063:551410:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465348] [luna-0062:1724877:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465342] [luna-0062:1724874:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465360] [luna-0062:1724865:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465739] [luna-0063:551407:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465417] [luna-0062:1724878:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465449] [luna-0062:1724870:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.466063] [luna-0063:551415:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465776] [luna-0062:1724863:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117404.465772] [luna-0062:1724852:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.608562] [luna-0062:1724852:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609375] [luna-0063:551407:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609399] [luna-0063:551415:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609489] [luna-0063:551413:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609213] [luna-0062:1724870:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609576] [luna-0063:551410:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609332] [luna-0062:1724878:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609749] [luna-0063:551408:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609793] [luna-0063:551409:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609545] [luna-0062:1724863:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609594] [luna-0062:1724876:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609811] [luna-0062:1724877:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.610080] [luna-0063:551411:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.609831] [luna-0062:1724874:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.610269] [luna-0063:551414:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117405.610001] [luna-0062:1724865:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.252270] [luna-0062:1724878:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.252486] [luna-0062:1724870:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.252972] [luna-0063:551407:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.252988] [luna-0063:551410:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.253180] [luna-0063:551408:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.253213] [luna-0063:551414:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.252935] [luna-0062:1724877:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.252946] [luna-0062:1724865:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.275740] [luna-0062:1724876:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.276116] [luna-0063:551411:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.276354] [luna-0063:551413:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.276087] [luna-0062:1724874:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.278249] [luna-0063:551409:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.277894] [luna-0062:1724852:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.278279] [luna-0063:551415:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.278038] [luna-0062:1724863:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.835123] [luna-0063:551414:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.835124] [luna-0063:551410:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.835127] [luna-0063:551411:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.835126] [luna-0063:551409:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.836288] [luna-0063:551408:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.836295] [luna-0063:551407:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.836293] [luna-0063:551415:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.836305] [luna-0063:551413:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.853934] [luna-0062:1724865:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.853936] [luna-0062:1724870:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.853941] [luna-0062:1724876:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.853936] [luna-0062:1724852:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.857540] [luna-0062:1724877:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.857543] [luna-0062:1724878:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.857547] [luna-0062:1724874:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
[1667117406.857540] [luna-0062:1724863:0] ucc_context.c:240 UCC ERROR required TL nccl is not part of the context
@almogsegal https://github.com/openucx/ucc/pull/667 does this work for you?
@manjugv it definitely does. Thank you! For long term, I think it would be nice to be able to query the context as I suggest so libraries and other users can make performance hints for the users. E.g.
UCC was not compiled with NCCL support. To achieve better performance, consider recompiling with NCCL.
@almogsegal fyi just merged #667