pytorch-lightning Multiple GPU per node could fail silently with KubeflowEnvironment

🐛 Bug

If the user tries to submit a ddp job to a Kubeflow env with multi-gpus per node by following multi-GPU docs and passing the right args (num_nodes and devices) one of the following would happen:

WORLD_SIZE and RANK are set to total number of processes -> the job gets stuck because creates_processes_externally=True doesn't let ddp launch other processes.
WORLD_SIZE and RANK are set to total number of nodes -> the job starts with only local rank 0 of each node participating in distributed training. The major issue here apart from the idle GPUs is that DDPStrategy still works correctly and passes the right number of replicas to the distributed sampler:

...
        self.cluster_environment.set_global_rank(self.node_rank * self.num_processes + self.local_rank)
        self.cluster_environment.set_world_size(self.num_nodes * self.num_processes)

So local rank 0 GPUs will get 1/num_processes of the data assuming other (idle) GPUs are processing the rest. All while training is being done only on a subset of the dataset that was assigned to local rank 0 of each node. The user is unaware of this since they assume they passed devices/gpus and num_nodes to trainer correctly.

To Reproduce

N/A (it's how KubeflowEnvironment works)

Expected behavior

I'm not sure if this is the expected behavior. I am using Google Vertex AI that runs Kubeflow under the hood. When a Pytorch Lightning job is submitted to Vertex, Pytorch Lightning automatically selects KubeflowEnvironment as the cluster environment.

Please let me know if the expectation is to have a separate cluster environment class for something like VertexAI. I'd be happy to create a PR to add the new Env. But the reason why I decided to report this as a bug are:

KubeflowEnvironment has two very specific requirements a. nodes with a single GPU and b. manual creation of the processes. Neither of these requirements are related to or enforced by Kubeflow. The requirements are also not mentioned in the docs and the user wouldn't know this until they look at the code.
The detect method of KubeflowEnvironment can be used for any Kubernetes env, and the rest of its methods basically implement an especial case of LightningEnvironment where the user has to manually run the processes.

cc @awaelchli

Jun 22 '22 21:06 RamtinRassoli

I believe the calt ligatures include one for := which aligns vertically like you expect, see https://github.com/githubnext/monaspace#coding-ligatures

Dec 05 '23 15:12 ian-h-chamberlain

In the latest font update, ss07 will align these characters vertically.

May 21 '24 22:05 heathercran

pytorch-lightning pytorch-lightning copied to clipboard

Multiple GPU per node could fail silently with KubeflowEnvironment

🐛 Bug

To Reproduce

Expected behavior

pytorch-lightning
pytorch-lightning copied to clipboard