accelerate
accelerate copied to clipboard
Cannot submit jobs with different num_processes with SLURM
Hi, my SLURM bash looks like:
accelerate launch \ --mixed_precision=bf16 \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \
I need to run accelerate config in an interactive way first and set distributed_type to multi_gpu also so that it runs with 4 GPUs.
However, when I want to submit a new interactive job when only 1 process, I need to run accelerate config again to make it run on the single GPU. So I think there maybe some bug when my 4-GPU task resumes from the checkpoint, is there any suggestions?
You can just not use accelerate config in this instance. E.g.:
accelerate launch \ --mixed_precision=bf16 \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \
(This alone should be just fine and automatically work btw)
or:
accelerate launch \ --multi_gpu \ --mixed_precision=bf16 \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \
Also, while it is SLURM, if it's just one machine you don't need to add --machine_rank and --num_machines
You can just not use
accelerate configin this instance. E.g.:accelerate launch \ --mixed_precision=bf16 \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \(This alone should be just fine and automatically work btw)
or:
accelerate launch \ --multi_gpu \ --mixed_precision=bf16 \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \
Thanks for your reply!
My problem is I sometimes debug on SLURM with a single GPU and I find the distributed_type can only be modified by accelerate config, but running that may influence my multi-GPU jobs.
accelerate launch \ --mixed_precision=bf16 \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \ cannot change thedefault_config.yamlin my case, and I see the accelerate.state which still shows num_processes=1, but after I run accelerate config then it is OK.
Maybe I can try export ACCELERATE_CONFIG_FILE=/path/to/my_accelerate_config.yaml when running on single GPU? DOes that make sense to you?
If you don't have a config file and just pass in --multi_gpu it will be just fine.
You also can pass in --num_processes {x} which will help.
To point to a config file, you can do accelerate launch --config_file {ENV_VAR} which would be the easiest solution here as all of your configs can be stored there, and you can keep these config files in a shared filesystem somewhere the server can reach (or you can grab etc). This is what I tend to do
If you don't have a config file and just pass in
--multi_gpuit will be just fine.You also can pass in
--num_processes {x}which will help.To point to a config file, you can do
accelerate launch --config_file {ENV_VAR}which would be the easiest solution here as all of your configs can be stored there, and you can keep these config files in a shared filesystem somewhere the server can reach (or you can grab etc). This is what I tend to do
Thank you so much!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.