accelerate
accelerate copied to clipboard
how to set `num_processes` in multi-node training
Is it the total num of gpus or the number of gpus on a single node? I have seen contradictory signals in the code.
https://github.com/huggingface/accelerate/blob/ee004674b9560976688e1a701b6d3650a09b2100/docs/source/usage_guides/ipex.md?plain=1#L139 https://github.com/huggingface/accelerate/blob/ee004674b9560976688e1a701b6d3650a09b2100/src/accelerate/state.py#L154 here, it seems like the total number of gpus.
https://github.com/huggingface/accelerate/blob/ee004674b9560976688e1a701b6d3650a09b2100/examples/slurm/submit_multigpu.sh#L27 here, it sees like the number of gpus per node.
It is total number of GPUs, we then reduce it by num_machines. (That SLURM example looks to be wrong possibly)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
It is total number of GPUs, we then reduce it by
num_machines. (That SLURM example looks to be wrong possibly)
Given 4 nodes and 8 GPUs per node, do you mean the --num_processes in the bash script should be 32 but in python code it will then be reduced to 32/4=8? Why should you reduce it?
I'm stating the launcher will reduce it. --num_processes is the total number of GPUs and assumes each node has the same number of GPUs on each. So rather than --n-proc-per-node=2 --nnodes=2 you just set --num_processes=4 + the multi-node setup in this situation.
I'm stating the launcher will reduce it.
--num_processesis the total number of GPUs and assumes each node has the same number of GPUs on each. So rather than--n-proc-per-node=2--nnodes=2you just set--num_processes=4+ the multi-node setup in this situation.
OK, I got it. Furthermore, in the above 4 nodes and 8 GPUs per node case, how to get the global rank and world size?
I think the expected values are: global_rank $\in$ [0,31] and world_size = 32. But when I use os.environ['RANK'] and os.environ['WORLD_SIZE'], they are $\in$ [0,31] and equals 8, respectively.
Besides, the code contained in the condition if accelerator.is_main_process: would still run 4 times (1 time on each node). Are these the expected behavior?
num_processes and process_index get their information from torch.distributed.get_world_size() and torch.distributed.get_rank()
if accelerator.is_main_process should only run on the main node and its first process. is_local_main_process would be ran 4 times, one time on each node
@muellerzr sorry to piggyback on this thread, I'm running a set up with two nodes, one node has 4 gpus and the other has 1. I'd like to utilize this mixed set up, can I provide something like "--n-proc-per-node" to override the accelerate's default setting which assumes the gpus to be equal across nodes it's currently causing the session to fail because it attempts to launch more than 1 process on the node with a single gpu
@iantbutler01 Do you have any updates on that? Iād also like to specify a different number of processes per node
I am also unable to use accelerate with a variable number of GPUs per node. I would love to know if there is a way to use accelerate in this setting.