Remove NCCL_NET_GDR_LEVEL and NCCL_NET_GDR_C2C environment variables.
These variables were a workaround for NCCL versions older than 2.27.x and are no longer needed in this context.
- Removed setting of these variables from scripts/performance/executors.py
Can we add conditional checks instead of removing? We can do a torch NCCL version check.
Can we add conditional checks instead of removing? We can do a torch NCCL version check.
@sanandaraj5597 We can, I removed it outright since the latest NeMo container releases have a new enough NCCL version to not be needed.
@sanandaraj5597 Actually how would that work? The executor script is run during job launch on the local environment and outside the container env. We won't know what the NCCL version is until after the job starts.
Ack on your point about executor script launch. I also agree that's a problem. Why do we want to remove this? I don't want someone in the future to run an older container to debug a regression and this change causing the debug to be harder.
We recently had an internal team run into issues with Nemotron4 and checkpointing when those flags were set when using a Nemo container with NCCL 2.27.x+ (can link slack thread if interested)
In general I'm leery about leaving unneeded vars around given potential unintended consequences later. I also don't recall right now how big of an impact those settings actually were on perf.
Is there a way with the current structure of Nemo to do a run time check and inject the settings there?
We recently had an internal team run into issues with Nemotron4 and checkpointing when those flags were set when using a Nemo container with NCCL 2.27.x+
Shouldn't this be an issue NCCL should fix for backward compatibility? Not sure if M-Bridge is the right place to fix this. But anyways if this is blocking your work, I'm okay removing this but I would fix the actual problem in NCCL than to workaround by patching M-Bridge.