alpaca-lora Multi-GPU bug?

Traceback (most recent call last):
  File "/workspace/alpaca-lora/finetune.py", line 95, in <module>
    trainer.train(resume_from_checkpoint=False)
  File "/workspace/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1628, in train
    return inner_training_loop(
  File "/workspace/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1895, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2637, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2669, in compute_loss
    outputs = model(**inputs)
  File "/workspace/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward
    raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
  0%|                                                                    | 0/1083 [00:00<?, ?it/s]

Trying to run this on a 4xA100 instance, getting this error. Nvidia-smi shows that something is getting loaded into both gpu0 and gpu1:

Code_NTtTlXr5e5

Mar 14 '23 04:03 chavinlo

This code doesn't work on multi-GPU yet; I'm still running it on my single RTX 4090. Might adapt to multi-GPU in a bit to speed up training.

(Also please git pull if you haven't recently; I fixed a bug in the dataset generation code)

Mar 14 '23 04:03 tloen

Might adapt to multi-GPU in a bit to speed up training. Could you point me to the reason why this is not working on multiple GPUs? I.e., which part is breaking it? The LoRA stuff or the 8-bit stuff? Something else? Thanks!

Mar 15 '23 09:03 janmaltel

The PEFT code needs to be adapted to make better use of accelerate. I think there are some examples of how to do it in the huggingface/peft repo but I can't test them as I don't have a multi-GPU setup myself.

Mar 16 '23 03:03 tloen

The PEFT code needs to be adapted to make better use of accelerate. I think there are some examples of how to do it in the huggingface/peft repo but I can't test them as I don't have a multi-GPU repo myself.

That would be neat. Kaggle already offers 2xT4 with 2x16GB RAM, which would be probably quite slow but probably enough to train 13B model.

Mar 17 '23 21:03 C00reNUT

this may work for you: https://discord.com/channels/1086739839761776660/1087706061022187641/1087944493564698674

This link seems to be a private link. What is it about?

Mar 23 '23 05:03 huijiawu0

I use the following command to run normally on multi-gpu CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py` But actually only gpu 0 and 1 are used. I don't know why

Mar 23 '23 06:03 LiuChen19960902

I use only one GPU, add some codes in beginning

gpu_list = [7]
gpu_list_str = ','.join(map(str, gpu_list))
os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

device assign the one GPU to torch

Mar 29 '23 02:03 yayeoCddy

Just accelerate launch finetune.py. It works.

Mar 31 '23 02:03 sooftware

I use the following command to run normally on multi-gpu CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py` But actually only gpu 0 and 1 are used. I don't know why

Try to enable all the GPUs by running accelerate config

Apr 06 '23 11:04 srogatch

fwiw solution for me was to not use torchrun to launch the script. I was having an issue where single-GPU training worked fine, but with multi-GPU training after a single update step, the model would freeze -- gpu-util was at 100% but no more updates happened. Getting rid of torchrun and simply calling the python script solved it and seems to use DDP fine.

would be great to have more guidance on what kinds of setups work for launching multi-GPU jobs. would be happy to contribute information about my setup for that as well :)

May 05 '23 02:05 jpgard

fwiw solution for me was to not use torchrun to launch the script. I was having an issue where single-GPU training worked fine, but with multi-GPU training after a single update step, the model would freeze -- gpu-util was at 100% but no more updates happened. Getting rid of torchrun and simply calling the python script solved it and seems to use DDP fine.

would be great to have more guidance on what kinds of setups work for launching multi-GPU jobs. would be happy to contribute information about my setup for that as well :)

Are you able to do multi-GPU inferencing?

Jul 12 '23 13:07 scotgopal

alpaca-lora alpaca-lora copied to clipboard

Multi-GPU bug?

alpaca-lora
alpaca-lora copied to clipboard