qlora
qlora copied to clipboard
Multi-GPU Training
Directly running qlora.py on a machine with multi GPUs will load the model on multiple GPUs, but the training process is conducted on a single GPU only. The training batch size is equal to per_device_train_batch_size. Is anyone success with Multi-GPU training?
Hi, I have same problem here. For my case, during the training stage, there is an error saying that it expected all tensors to be on the same device.
Hello I added some information on the multi-gpu setup in the README. In qlora.py
we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.
One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.
Still got the error saying that it expected all tensors to be on the same device.cuda:0, cuda:1
Hello I added some information on the multi-gpu setup in the README. In
qlora.py
we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.
Thanks for reply. For GPU memory side, naively multiplying the number of GPUs to per_device_train_batch_size works for me. Looks like accelerate put tensor generated during LoRA fine-tuning to all GPUs equally, the same as the pre-trained model. For training speed side, looks like accelerate do multi-GPU training sequentially, which make the training speed slow.
Hi all,
I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".
It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?
Hello I added some information on the multi-gpu setup in the README. In
qlora.py
we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.
This is kind of puzzling to me about how "accelerate" is being used with qlora. If it distributes different layers to different GPUs, then I would expect to be able to pipeline training data so that all the GPUs work at the same time on different stages of training data batches.
Is this a design choice that needs to be revisited? I have the training process running now on 4 Titan X and yeah I can see how only one is doing something at a time. It means that a batch has to go through the entire pipeline before the new batch is put in.
It makes no sense to me. It runs 4 times slower than it could wrt. throughput.
Still got the error saying that
it expected all tensors to be on the same device.cuda:0, cuda:1
Even I am getting the same. Any Idea how to resolve the issue?
iple devices. Anyone met same issue?
how to solve?
ddp_find_unused_parameters False
Hi all,
I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".
It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?
same error
Hi all,
I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".
It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?
same error, @artidoro Can you help us to solve this problem? Thanks.
I have solve the problem, ddp_find_unused_parameters=False
you can learn more with the code: https://github.com/yangjianxin1/Firefly/blob/master/train_qlora.py#L104
I have 4 GPUs, why is always only one or two of them computing?
Repost from this thread : Multi-gpu training example?
Finally, it works. Now it utilized all GPUs
!pip install bitsandbytes==0.41.1
!pip install transformers==4.31.0
!pip install peft==0.4.0
!pip install accelerate==0.21.0 einops==0.6.1 evaluate==0.4.0 scikit-learn==1.2.2 sentencepiece==0.1.99
*change qlora.py
device_map='auto'
=> device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}
!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
--dataset="/workspace/your_dataset.csv" \
--do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
--learning_rate=0.0002 --use_auth_token=True \
--evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
--gradient_checkpointing=True --ddp_find_unused_parameters=False
Tested in Runpod environment with Python 3.10 and Torch 2.0.0+cu117
when gradient_checkpointing is True, a little bit slow. But it spread all GPU VRAM usage.
For example if one GPU, it needs 20 GBs of VRAM. If two GPUs, it needs 20/2=10 GB/GPU, If three GPUs, it needs 20/3 GB=6,67 GB/GPU.
Got 15 seconds/iters
Compared to
!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
--dataset="/workspace/your_dataset.csv" \
--do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
--learning_rate=0.0002 --use_auth_token=True \
--evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
--gradient_checkpointing=False
when gradient_checkpointing is False, yet it will be faster. But it more consumes more GPU VRAM.
For example if one GPU, it needs 20 GBs of VRAM. If two GPUs, it needs 20x2=40 GB total, If three GPUs, it needs 20x3 GB=60 GB total.
Got 10 seconds/iter. But it consumes gpu usage multipled by number of GPUs.
Compared to the vanilla one (original)
!python3.10 qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
--dataset="/workspace/your_dataset.csv" \
--do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
--learning_rate=0.0002 --use_auth_token=True \
--evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True
Got 55 seconds/iter. So it is very slow compared previous method.
@ichsan2895 how does this parallelization exactly works? Does each of the gpu's compute a different part of the process? I understand that the memory of each GPU is independent of the rest for that execution, right (it is not the sum of all the memories)?
@ichsan2895 how does this parallelization exactly works? Does each of the gpu's compute a different part of the process? I understand that the memory of each GPU is independent of the rest for that execution, right (it is not the sum of all the memories)?
It seems to be parallel since I saw this code https://github.com/artidoro/qlora/blob/7f4e95a68dc076bea9b3a413d2b512eca6d004e5/qlora.py#L341-L342
But I'm sorry I don't understand the detailed ways of parallelization works. My focuses is only makes the code works with multi GPUs :+1:
@ichsan2895 Can we use 3 gpus if we have 4 gpus in our machine to do data parallel ?