[BUG]: Unable to figure out how to pass env variables for each node
Is there an existing issue for this bug?
- [x] I have searched the existing issues
The bug has not been fixed in the latest main branch
- [x] I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
I am trying to carry out a multi node setup. I wanted ot understand how I can set a different NCCL_SOCKET_IFNAME for each node. I tried setting them in each node environment but that doesnt seem to be working, even tried setting them as overall OS environment vars but that didnt work either. Is there anyway to pass them for each node?
Environment
No response
@Gautam-Rajeev I think master node env vars will be synced to the rest:
https://github.com/hpcaitech/ColossalAI/blob/6d676ee0e95d54df90b4ee640dee0e0a198ab8f3/colossalai/cli/launcher/run.py#L280-L287
https://github.com/hpcaitech/ColossalAI/blob/6d676ee0e95d54df90b4ee640dee0e0a198ab8f3/colossalai/cli/launcher/multinode_runner.py#L47-L53
You might want to change the code a bit to allow different NCCL_SOCKET_IFNAME, or simply clear env dict and control everything on your own by setting env vars on different nodes.