transformer_nuggets [WIP] full finetune / qlora + ac/offload/optm in bwd

Why composing FSDP with NF4Tensor

QLoRA : number of trainable parameters are reduced from xxx to xxx. parameter size are reduced by xx Full finetuning original Llama with 4bit quantized params:

7B + QLora on FFNs: summarizing memory usage below with bf16, adamW, AC, cpu offloading

sharding NF4Tensor in FSDP: NF4Tensor are 4 bit quant weight from QLora
cpu offloading NF4Tensor in FSDP: most profitabble
gradient in the backward, 8bit optimizer: does not matter in QLora because of tiny gradable parameters. should priooritize in full training

Feb 26 '24 20:02 weifengpy

Should we call out that this table assumes that we are only applying QLoRA to the FFNs?

Feb 26 '24 22:02 awgu

close is since QLoRA + FSDP2 and cpu offloading has been landed into torchtune: https://github.com/pytorch/torchtune/pull/909

Jul 15 '24 22:07 weifengpy