nebuly
nebuly copied to clipboard
for rlhf_accelerate branch, can't run with multiGPU
my ~/.cache/huggingface/accelerate/default_config.yaml is: compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_config: {} fsdp_config: {} gpu_ids: all machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
my start training command is: accelerate launch --multi_gpu /home/ubuntu/ubuntu/artifacts/main.py /home/ubuntu/ubuntu/artifacts/config/config.yaml --type REWARD
and then I got this:
It seems that the program are using the same GPU, how is that happend? For the old version of main branch, I did succed run with multiGPU using this command.
Hi @balcklive thanks for reaching out! We are currently working on the matter, we will get back to you as soon as we have a fix!
Hi @balcklive, The new PR #306 should have fixed this problem! Remember to start the training using deepspeed or accelerate launch instead of python (more on this in the readme of the linked PR) and enable one of them in the config.yaml. Sorry for the delay