peft icon indicating copy to clipboard operation
peft copied to clipboard

question about training time

Open harborsarah opened this issue 1 year ago • 4 comments

System Info

Dear authors,

I have a question regarding the training time utilizing the peft package. I tried using LoRA with a swin transformer to reduce the parameter size.

model = SwinModel.from_pretrained('./swin-large-patch4-window7-224-in22k').cuda()
config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["classifier"],
)
lora_model = get_peft_model(model, config)

And finally, train on the lora_model. My question is: as I tried, train 'model' and train 'lora_model' almost have the same running time, even though the parameter size is reduced from 200M to 1M. Is that normal? or did I do something wrong?

Thanks a lot for your reply.

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder
  • [ ] My own task or dataset (give details below)

Reproduction

model = SwinModel.from_pretrained('./swin-large-patch4-window7-224-in22k').cuda() config = LoraConfig( r=16, lora_alpha=16, target_modules=["query", "value"], lora_dropout=0.1, bias="none", modules_to_save=["classifier"], ) lora_model = get_peft_model(model, config)

Expected behavior

Please give an explanation about this situation

harborsarah avatar Sep 12 '24 07:09 harborsarah

Indeed, using LoRA does not necessarily reduce the training time. On the one hand, there are less gradients, which should help, on the other hand, LoRA adds extra computation, which slows training down. Remember, the main goal of LoRA is to reduce memory requirements, not training time. However, given that you need less memory than for full fine-tuning, you should be able to increase the batch size to get some speed gains.

BenjaminBossan avatar Sep 12 '24 10:09 BenjaminBossan

LoRA optimizes memory usage by saving optimizer states, particularly beneficial when using Adam and mixed precision training.

Consider a model with X parameters:

Forward pass parameters consume 2X memory. Update parameters consume 4X memory. Gradients consume 2X memory. Adam optimizer states consume 8X memory. By disabling full-rank parameter updates, LoRA significantly reduces optimizer state memory usage, as only a low-rank subset of parameters requires updates. However, full-rank gradient computations remain necessary for backpropagation.

Therefore, while LoRA reduces memory consumption, it does not decrease training time.

hhnqqq avatar Sep 13 '24 03:09 hhnqqq

Therefore, while LoRA reduces memory consumption, it does not decrease training time.

I agree, this is the main goal. However, I find that in practice, LoRA often reduces training time too, although not by a lot. For example, I just ran this LoRA example twice, once exactly as it is and once with full fine-tuning, but all the rest being the same. I found that with LoRA, I get ~10.5 it/s and with full fine-tuning, I get ~6.5 it/s.

I didn't run any profiling to see where the speed advantage comes from, but my guess would be fewer gradients and parameter updates. At the end of the day, whether there is a speed advantage or not will depend on hyper-parameters, model architecture, dataset, etc. so I would not go into LoRA training expecting a speed gain (except if I can increase the batch size), it's more of a nice bonus if it happens.

BenjaminBossan avatar Sep 13 '24 09:09 BenjaminBossan

Therefore, while LoRA reduces memory consumption, it does not decrease training time.

I agree, this is the main goal. However, I find that in practice, LoRA often reduces training time too, although not by a lot. For example, I just ran this LoRA example twice, once exactly as it is and once with full fine-tuning, but all the rest being the same. I found that with LoRA, I get ~10.5 it/s and with full fine-tuning, I get ~6.5 it/s.

I didn't run any profiling to see where the speed advantage comes from, but my guess would be fewer gradients and parameter updates. At the end of the day, whether there is a speed advantage or not will depend on hyper-parameters, model architecture, dataset, etc. so I would not go into LoRA training expecting a speed gain (except if I can increase the batch size), it's more of a nice bonus if it happens.

Actually, LoRA really benefits in training time when distributed training is utilized. Particularly, the most popular strategy, ZeRO, requires communication among ranks for optimizer states. LoRA can efficiently reduce the traffic.

hhnqqq avatar Sep 23 '24 08:09 hhnqqq

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Oct 17 '24 15:10 github-actions[bot]