transformer_nuggets icon indicating copy to clipboard operation
transformer_nuggets copied to clipboard

[WIP] full finetune / qlora + ac/offload/optm in bwd

Open weifengpy opened this issue 1 year ago • 1 comments

Why composing FSDP with NF4Tensor

QLoRA : number of trainable parameters are reduced from xxx to xxx. parameter size are reduced by xx Full finetuning original Llama with 4bit quantized params:

7B + QLora on FFNs: summarizing memory usage below with bf16, adamW, AC, cpu offloading

  • sharding NF4Tensor in FSDP: NF4Tensor are 4 bit quant weight from QLora
  • cpu offloading NF4Tensor in FSDP: most profitabble
  • gradient in the backward, 8bit optimizer: does not matter in QLora because of tiny gradable parameters. should priooritize in full training
Screenshot 2024-02-26 at 12 50 45 PM Screenshot 2024-02-27 at 12 39 54 PM

weifengpy avatar Feb 26 '24 20:02 weifengpy

Should we call out that this table assumes that we are only applying QLoRA to the FFNs?

awgu avatar Feb 26 '24 22:02 awgu

close is since QLoRA + FSDP2 and cpu offloading has been landed into torchtune: https://github.com/pytorch/torchtune/pull/909

weifengpy avatar Jul 15 '24 22:07 weifengpy