NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Remove NCCL_NET_GDR_LEVEL and NCCL_NET_GDR_C2C environment variables.

Open sudostock opened this issue 1 month ago • 5 comments

These variables were a workaround for NCCL versions older than 2.27.x and are no longer needed in this context.

  • Removed setting of these variables from scripts/performance/executors.py

sudostock avatar Dec 08 '25 17:12 sudostock

Can we add conditional checks instead of removing? We can do a torch NCCL version check.

sanandaraj5597 avatar Dec 08 '25 18:12 sanandaraj5597

Can we add conditional checks instead of removing? We can do a torch NCCL version check.

@sanandaraj5597 We can, I removed it outright since the latest NeMo container releases have a new enough NCCL version to not be needed.

sudostock avatar Dec 08 '25 21:12 sudostock

@sanandaraj5597 Actually how would that work? The executor script is run during job launch on the local environment and outside the container env. We won't know what the NCCL version is until after the job starts.

sudostock avatar Dec 08 '25 22:12 sudostock

Ack on your point about executor script launch. I also agree that's a problem. Why do we want to remove this? I don't want someone in the future to run an older container to debug a regression and this change causing the debug to be harder.

sanandaraj5597 avatar Dec 09 '25 02:12 sanandaraj5597

We recently had an internal team run into issues with Nemotron4 and checkpointing when those flags were set when using a Nemo container with NCCL 2.27.x+ (can link slack thread if interested)

In general I'm leery about leaving unneeded vars around given potential unintended consequences later. I also don't recall right now how big of an impact those settings actually were on perf.

Is there a way with the current structure of Nemo to do a run time check and inject the settings there?

sudostock avatar Dec 09 '25 23:12 sudostock

We recently had an internal team run into issues with Nemotron4 and checkpointing when those flags were set when using a Nemo container with NCCL 2.27.x+

Shouldn't this be an issue NCCL should fix for backward compatibility? Not sure if M-Bridge is the right place to fix this. But anyways if this is blocking your work, I'm okay removing this but I would fix the actual problem in NCCL than to workaround by patching M-Bridge.

sanandaraj5597 avatar Dec 10 '25 17:12 sanandaraj5597