accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Cannot submit jobs with different num_processes with SLURM

Open JoyHuYY1412 opened this issue 1 year ago • 5 comments

Hi, my SLURM bash looks like:

accelerate launch \ --mixed_precision=bf16 \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \ I need to run accelerate config in an interactive way first and set distributed_type to multi_gpu also so that it runs with 4 GPUs.

However, when I want to submit a new interactive job when only 1 process, I need to run accelerate config again to make it run on the single GPU. So I think there maybe some bug when my 4-GPU task resumes from the checkpoint, is there any suggestions?

JoyHuYY1412 avatar Feb 26 '24 17:02 JoyHuYY1412

You can just not use accelerate config in this instance. E.g.:

accelerate launch \ --mixed_precision=bf16  \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \

(This alone should be just fine and automatically work btw)

or:

accelerate launch \ --multi_gpu \ --mixed_precision=bf16  \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \

muellerzr avatar Feb 26 '24 17:02 muellerzr

Also, while it is SLURM, if it's just one machine you don't need to add --machine_rank and --num_machines

muellerzr avatar Feb 26 '24 17:02 muellerzr

You can just not use accelerate config in this instance. E.g.:

accelerate launch \ --mixed_precision=bf16  \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \

(This alone should be just fine and automatically work btw)

or:

accelerate launch \ --multi_gpu \ --mixed_precision=bf16  \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \

Thanks for your reply! My problem is I sometimes debug on SLURM with a single GPU and I find the distributed_type can only be modified by accelerate config, but running that may influence my multi-GPU jobs.

accelerate launch \ --mixed_precision=bf16 \ --machine_rank 0 \ --num_machines 1 \ --main_process_port 11135 \ --num_processes $GPUS_PER_NODE \ fastcomposer/train.py \ cannot change thedefault_config.yamlin my case, and I see the accelerate.state which still shows num_processes=1, but after I run accelerate config then it is OK.

Maybe I can try export ACCELERATE_CONFIG_FILE=/path/to/my_accelerate_config.yaml when running on single GPU? DOes that make sense to you?

JoyHuYY1412 avatar Feb 26 '24 17:02 JoyHuYY1412

If you don't have a config file and just pass in --multi_gpu it will be just fine.

You also can pass in --num_processes {x} which will help.

To point to a config file, you can do accelerate launch --config_file {ENV_VAR} which would be the easiest solution here as all of your configs can be stored there, and you can keep these config files in a shared filesystem somewhere the server can reach (or you can grab etc). This is what I tend to do

muellerzr avatar Feb 26 '24 19:02 muellerzr

If you don't have a config file and just pass in --multi_gpu it will be just fine.

You also can pass in --num_processes {x} which will help.

To point to a config file, you can do accelerate launch --config_file {ENV_VAR} which would be the easiest solution here as all of your configs can be stored there, and you can keep these config files in a shared filesystem somewhere the server can reach (or you can grab etc). This is what I tend to do

Thank you so much!

JoyHuYY1412 avatar Feb 26 '24 22:02 JoyHuYY1412

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 28 '24 15:03 github-actions[bot]