accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

how to set `num_processes` in multi-node training

Open lxww302 opened this issue 1 year ago • 1 comments

Is it the total num of gpus or the number of gpus on a single node? I have seen contradictory signals in the code.

https://github.com/huggingface/accelerate/blob/ee004674b9560976688e1a701b6d3650a09b2100/docs/source/usage_guides/ipex.md?plain=1#L139 https://github.com/huggingface/accelerate/blob/ee004674b9560976688e1a701b6d3650a09b2100/src/accelerate/state.py#L154 here, it seems like the total number of gpus.

https://github.com/huggingface/accelerate/blob/ee004674b9560976688e1a701b6d3650a09b2100/examples/slurm/submit_multigpu.sh#L27 here, it sees like the number of gpus per node.

lxww302 avatar Mar 04 '24 13:03 lxww302

It is total number of GPUs, we then reduce it by num_machines. (That SLURM example looks to be wrong possibly)

muellerzr avatar Mar 04 '24 14:03 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 03 '24 15:04 github-actions[bot]

It is total number of GPUs, we then reduce it by num_machines. (That SLURM example looks to be wrong possibly)

Given 4 nodes and 8 GPUs per node, do you mean the --num_processes in the bash script should be 32 but in python code it will then be reduced to 32/4=8? Why should you reduce it?

ygtxr1997 avatar May 09 '24 13:05 ygtxr1997

I'm stating the launcher will reduce it. --num_processes is the total number of GPUs and assumes each node has the same number of GPUs on each. So rather than --n-proc-per-node=2 --nnodes=2 you just set --num_processes=4 + the multi-node setup in this situation.

muellerzr avatar May 09 '24 13:05 muellerzr

I'm stating the launcher will reduce it. --num_processes is the total number of GPUs and assumes each node has the same number of GPUs on each. So rather than --n-proc-per-node=2 --nnodes=2 you just set --num_processes=4 + the multi-node setup in this situation.

OK, I got it. Furthermore, in the above 4 nodes and 8 GPUs per node case, how to get the global rank and world size? I think the expected values are: global_rank $\in$ [0,31] and world_size = 32. But when I use os.environ['RANK'] and os.environ['WORLD_SIZE'], they are $\in$ [0,31] and equals 8, respectively. Besides, the code contained in the condition if accelerator.is_main_process: would still run 4 times (1 time on each node). Are these the expected behavior?

ygtxr1997 avatar May 09 '24 14:05 ygtxr1997

num_processes and process_index get their information from torch.distributed.get_world_size() and torch.distributed.get_rank()

if accelerator.is_main_process should only run on the main node and its first process. is_local_main_process would be ran 4 times, one time on each node

muellerzr avatar May 09 '24 15:05 muellerzr

@muellerzr sorry to piggyback on this thread, I'm running a set up with two nodes, one node has 4 gpus and the other has 1. I'd like to utilize this mixed set up, can I provide something like "--n-proc-per-node" to override the accelerate's default setting which assumes the gpus to be equal across nodes it's currently causing the session to fail because it attempts to launch more than 1 process on the node with a single gpu

iantbutler01 avatar May 14 '24 07:05 iantbutler01

@iantbutler01 Do you have any updates on that? I’d also like to specify a different number of processes per node

francescotaioli avatar Apr 25 '25 19:04 francescotaioli

I am also unable to use accelerate with a variable number of GPUs per node. I would love to know if there is a way to use accelerate in this setting.

pradyumnaym avatar Aug 02 '25 17:08 pradyumnaym