nebuly icon indicating copy to clipboard operation
nebuly copied to clipboard

for rlhf_accelerate branch, can't run with multiGPU

Open balcklive opened this issue 1 year ago • 2 comments

my ~/.cache/huggingface/accelerate/default_config.yaml is: compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_config: {} fsdp_config: {} gpu_ids: all machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

my start training command is: accelerate launch --multi_gpu /home/ubuntu/ubuntu/artifacts/main.py /home/ubuntu/ubuntu/artifacts/config/config.yaml --type REWARD

and then I got this: 1679563172948 It seems that the program are using the same GPU, how is that happend? For the old version of main branch, I did succed run with multiGPU using this command.

balcklive avatar Mar 23 '23 09:03 balcklive

Hi @balcklive thanks for reaching out! We are currently working on the matter, we will get back to you as soon as we have a fix!

PierpaoloSorbellini avatar Mar 30 '23 07:03 PierpaoloSorbellini

Hi @balcklive, The new PR #306 should have fixed this problem! Remember to start the training using deepspeed or accelerate launch instead of python (more on this in the readme of the linked PR) and enable one of them in the config.yaml. Sorry for the delay

PierpaoloSorbellini avatar Apr 03 '23 14:04 PierpaoloSorbellini