pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

RFC: Remove `num_nodes` Trainer argument and infer world size from cluster environment directly

Open awaelchli opened this issue 3 years ago • 4 comments

🚀 Feature

Remove the redundant num_nodes Trainer argument. Knowing the number of nodes is not required, and the world size is provided by the cluster environment anyway.

Motivation

Users have always struggled getting multi-node training to work because there exists many points of failures in the configurations. One of them is setting the number of devices and nodes, where in Lightning you are required to set devices=n and num_nodes=m to match the world_size=n*m set by the cluster or launcher (torchrun, slurm, etc.). When num_nodes is not set correctly, it leads to too few processes joining the group and thus the program gets stuck.

We periodically receive issues and questions where it becomes clear users forget to set this values correctly (#13804, #14022, #10098, #8707, #7429, #6206 to name a few). But technically, Lightning never requires to know the number of nodes, only the world size. This value can be read from the ClusterEnvironment directly.

The conclusion: num_nodes in the Trainer is redundant.

Another benefit of not having to specify the number of nodes is that the cluster can dynamically allocate nodes and devices based on available resources. In the end, a user who wants to train on 8 GPUs does not care if it is over 2 nodes of 4 GPUs each or over 4 nodes with 2 GPUs each.

Historically, num_nodes was introduced before the ClusterEnvironment abstraction became the primary way of collecting information about the cluster.

Pitch

Deprecate the redundant num_nodes Trainer argument.

Pros:

  • One less thing for users to get wrong
  • One less step to explain in the docs
  • Code simplifies
  • It is no longer required to know the number of nodes ahead of time

Cons:

  • None currently known

Alternatives

Keep it, but validate num_nodes setting against cluster environment (#10107)

Additional context

#13506, #7361, #13605


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta @ananthsub @borda

awaelchli avatar Aug 07 '22 23:08 awaelchli

+1000 for this :D I think if there are specifications where this doesn't work with the cluster env, we should rather think about implementing these missing cluster envs :)

justusschock avatar Aug 08 '22 11:08 justusschock

Noob q:

Is it possible that num_nodes < world_size/gpus_per_device, where the user has access to more nodes in their cluster env than they actually need for their experiment?

rohitgr7 avatar Aug 08 '22 18:08 rohitgr7

@rohitgr7 The ClusterEnvironment we have here means just the environment the current "job" or "experiment" runs in. This is not providing information about the entire cluster. When we talk about world size or number of nodes, it is always just about how many of these resources are used for the current job, not the entirety of resources globally in the cluster.

awaelchli avatar Aug 08 '22 18:08 awaelchli

got it!

haven't trained on multi-node, still learning!! :)

rohitgr7 avatar Aug 08 '22 18:08 rohitgr7

why should devices equal to --ntasks-per-node? devices are the number of GPU, --ntasks-per-node determines the number of processes at the current node. Then, how to achieve data parallel when each GPU has only one process, and the one process has to work with a batch_size like 32, or 64, especially for image datasets.

x423xu avatar Jan 06 '23 23:01 x423xu

@x423xu That's how data-parallel training works. One process per GPU. Each process training the model with different data.

This is different from data loading. In each process, you can set DataLoader(num_workers=x) to whatever value you find works well. These two settings, devices=x and num_workers=y, are independent of each other. Two different concepts. Sorry if this is confusing!

I think this topic is a bit off from what is discussed here on the issue. But if you have any further questions, feel free to ping me on our community slack or in our forum.

awaelchli avatar Jan 06 '23 23:01 awaelchli

@awaelchli Thanks, I do have some questions and would move to Slack.

x423xu avatar Jan 06 '23 23:01 x423xu