qlora Multi-GPU Training

Directly running qlora.py on a machine with multi GPUs will load the model on multiple GPUs, but the training process is conducted on a single GPU only. The training batch size is equal to per_device_train_batch_size. Is anyone success with Multi-GPU training？

May 31 '23 18:05 wcy1122

Hi, I have same problem here. For my case, during the training stage, there is an error saying that it expected all tensors to be on the same device.

Jun 01 '23 03:06 lukaswangbk

Hello I added some information on the multi-gpu setup in the README. In qlora.py we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.

One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.

Jun 01 '23 05:06 artidoro

Still got the error saying that it expected all tensors to be on the same device.cuda:0, cuda:1

Jun 01 '23 07:06 lukaswangbk

Hello I added some information on the multi-gpu setup in the README. In qlora.py we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.

One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.

Thanks for reply. For GPU memory side, naively multiplying the number of GPUs to per_device_train_batch_size works for me. Looks like accelerate put tensor generated during LoRA fine-tuning to all GPUs equally, the same as the pre-trained model. For training speed side, looks like accelerate do multi-GPU training sequentially, which make the training speed slow.

Jun 01 '23 09:06 wcy1122

Hi all,

I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".

It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?

Jun 02 '23 16:06 dayuyang1999

Hello I added some information on the multi-gpu setup in the README. In qlora.py we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.

One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.

This is kind of puzzling to me about how "accelerate" is being used with qlora. If it distributes different layers to different GPUs, then I would expect to be able to pipeline training data so that all the GPUs work at the same time on different stages of training data batches.

Is this a design choice that needs to be revisited? I have the training process running now on 4 Titan X and yeah I can see how only one is doing something at a time. It means that a batch has to go through the entire pipeline before the new batch is put in.

It makes no sense to me. It runs 4 times slower than it could wrt. throughput.

Jun 02 '23 22:06 phalexo

Still got the error saying that it expected all tensors to be on the same device.cuda:0, cuda:1

Even I am getting the same. Any Idea how to resolve the issue?

Jun 04 '23 08:06 MrigankRaman

iple devices. Anyone met same issue?

how to solve?

Jun 05 '23 09:06 kevinuserdd

ddp_find_unused_parameters False

Jun 05 '23 12:06 znsoftm

Hi all,

I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".

It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?

same error

Jun 05 '23 17:06 zyxyxz

Hi all,

I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".

It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?

same error, @artidoro Can you help us to solve this problem? Thanks.

Jun 06 '23 09:06 fan-niu

I have solve the problem, ddp_find_unused_parameters=False

you can learn more with the code: https://github.com/yangjianxin1/Firefly/blob/master/train_qlora.py#L104

Jun 19 '23 03:06 yangjianxin1

I have 4 GPUs, why is always only one or two of them computing?

Jun 23 '23 07:06 oolongoo

Repost from this thread : Multi-gpu training example?

Finally, it works. Now it utilized all GPUs

!pip install bitsandbytes==0.41.1
!pip install transformers==4.31.0
!pip install peft==0.4.0 
!pip install accelerate==0.21.0 einops==0.6.1 evaluate==0.4.0 scikit-learn==1.2.2 sentencepiece==0.1.99

*change qlora.py device_map='auto' => device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=True --ddp_find_unused_parameters=False

Tested in Runpod environment with Python 3.10 and Torch 2.0.0+cu117

when gradient_checkpointing is True, a little bit slow. But it spread all GPU VRAM usage.

For example if one GPU, it needs 20 GBs of VRAM. If two GPUs, it needs 20/2=10 GB/GPU, If three GPUs, it needs 20/3 GB=6,67 GB/GPU.

Got 15 seconds/iters

Compared to

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=False

when gradient_checkpointing is False, yet it will be faster. But it more consumes more GPU VRAM.

For example if one GPU, it needs 20 GBs of VRAM. If two GPUs, it needs 20x2=40 GB total, If three GPUs, it needs 20x3 GB=60 GB total.

Got 10 seconds/iter. But it consumes gpu usage multipled by number of GPUs.

Compared to the vanilla one (original)

!python3.10 qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True

Got 55 seconds/iter. So it is very slow compared previous method.

Aug 22 '23 08:08 ichsan2895

@ichsan2895 how does this parallelization exactly works? Does each of the gpu's compute a different part of the process? I understand that the memory of each GPU is independent of the rest for that execution, right (it is not the sum of all the memories)?

Aug 23 '23 12:08 jsancs

@ichsan2895 how does this parallelization exactly works? Does each of the gpu's compute a different part of the process? I understand that the memory of each GPU is independent of the rest for that execution, right (it is not the sum of all the memories)?

It seems to be parallel since I saw this code https://github.com/artidoro/qlora/blob/7f4e95a68dc076bea9b3a413d2b512eca6d004e5/qlora.py#L341-L342

But I'm sorry I don't understand the detailed ways of parallelization works. My focuses is only makes the code works with multi GPUs :+1:

Aug 24 '23 00:08 ichsan2895

@ichsan2895 Can we use 3 gpus if we have 4 gpus in our machine to do data parallel ?

Sep 21 '23 02:09 trannhatquy

qlora qlora copied to clipboard

Multi-GPU Training

qlora
qlora copied to clipboard