transformer_nuggets
transformer_nuggets copied to clipboard
[WIP] full finetune / qlora + ac/offload/optm in bwd
Why composing FSDP with NF4Tensor
QLoRA : number of trainable parameters are reduced from xxx to xxx. parameter size are reduced by xx Full finetuning original Llama with 4bit quantized params:
7B + QLora on FFNs: summarizing memory usage below with bf16, adamW, AC, cpu offloading
- sharding NF4Tensor in FSDP: NF4Tensor are 4 bit quant weight from QLora
- cpu offloading NF4Tensor in FSDP: most profitabble
- gradient in the backward, 8bit optimizer: does not matter in QLora because of tiny gradable parameters. should priooritize in full training
Should we call out that this table assumes that we are only applying QLoRA to the FFNs?
close is since QLoRA + FSDP2 and cpu offloading has been landed into torchtune: https://github.com/pytorch/torchtune/pull/909