distributed-training-guide icon indicating copy to clipboard operation
distributed-training-guide copied to clipboard

MPI instructions missing local rank?

Open daire-byrne opened this issue 2 weeks ago • 0 comments

I tried the code changes for MPI as described in 03-job-launchers/README.md, but soon realised that the local rank was missing. I see that you added it as a command arg, but is it not better to use the OMPI_COMM_WORLD_LOCAL_RANK env?

I made these changes:

-    rank = int(os.getenv("RANK", "0"))
-    local_rank = rank % torch.cuda.device_count()
-    world_size = int(os.getenv("WORLD_SIZE", "1"))
+    rank = int(os.getenv("OMPI_COMM_WORLD_RANK", "0"))
+    local_rank = int(os.getenv("OMPI_COMM_WORLD_LOCAL_RANK", "0"))
+    world_size = int(os.getenv("OMPI_COMM_WORLD_SIZE", "1"))

daire-byrne avatar Nov 18 '25 10:11 daire-byrne