alpaca-lora
alpaca-lora copied to clipboard
Multi-GPU bug?
Traceback (most recent call last):
File "/workspace/alpaca-lora/finetune.py", line 95, in <module>
trainer.train(resume_from_checkpoint=False)
File "/workspace/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1628, in train
return inner_training_loop(
File "/workspace/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1895, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/workspace/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2637, in training_step
loss = self.compute_loss(model, inputs)
File "/workspace/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2669, in compute_loss
outputs = model(**inputs)
File "/workspace/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
0%| | 0/1083 [00:00<?, ?it/s]
Trying to run this on a 4xA100 instance, getting this error. Nvidia-smi shows that something is getting loaded into both gpu0 and gpu1:
This code doesn't work on multi-GPU yet; I'm still running it on my single RTX 4090. Might adapt to multi-GPU in a bit to speed up training.
(Also please git pull
if you haven't recently; I fixed a bug in the dataset generation code)
Might adapt to multi-GPU in a bit to speed up training. Could you point me to the reason why this is not working on multiple GPUs? I.e., which part is breaking it? The LoRA stuff or the 8-bit stuff? Something else? Thanks!
The PEFT code needs to be adapted to make better use of accelerate. I think there are some examples of how to do it in the huggingface/peft repo but I can't test them as I don't have a multi-GPU setup myself.
The PEFT code needs to be adapted to make better use of accelerate. I think there are some examples of how to do it in the huggingface/peft repo but I can't test them as I don't have a multi-GPU repo myself.
That would be neat. Kaggle already offers 2xT4 with 2x16GB RAM, which would be probably quite slow but probably enough to train 13B model.
this may work for you: https://discord.com/channels/1086739839761776660/1087706061022187641/1087944493564698674
This link seems to be a private link. What is it about?
I use the following command to run normally on multi-gpu
CUDA_VISIBLE_DEVICES=0,1,2,3
accelerate launch finetune.py`
But actually only gpu 0 and 1 are used.
I don't know why
I use only one GPU, add some codes in beginning
gpu_list = [7]
gpu_list_str = ','.join(map(str, gpu_list))
os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device
assign the one GPU to torch
Just accelerate launch finetune.py
.
It works.
I use the following command to run normally on multi-gpu
CUDA_VISIBLE_DEVICES=0,1,2,3
accelerate launch finetune.py` But actually only gpu 0 and 1 are used. I don't know why
Try to enable all the GPUs by running accelerate config
fwiw solution for me was to not use torchrun
to launch the script. I was having an issue where single-GPU training worked fine, but with multi-GPU training after a single update step, the model would freeze -- gpu-util was at 100% but no more updates happened. Getting rid of torchrun
and simply calling the python script solved it and seems to use DDP fine.
would be great to have more guidance on what kinds of setups work for launching multi-GPU jobs. would be happy to contribute information about my setup for that as well :)
fwiw solution for me was to not use
torchrun
to launch the script. I was having an issue where single-GPU training worked fine, but with multi-GPU training after a single update step, the model would freeze -- gpu-util was at 100% but no more updates happened. Getting rid oftorchrun
and simply calling the python script solved it and seems to use DDP fine.would be great to have more guidance on what kinds of setups work for launching multi-GPU jobs. would be happy to contribute information about my setup for that as well :)
Are you able to do multi-GPU inferencing?