fsdp_qlora
fsdp_qlora copied to clipboard
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
Hey, I'm loving the goal of lowering the resource requirements for training!
In this paper https://arxiv.org/abs/2403.06504 they claim direct memory access between the GPU<->Nvme Storage is more efficient at swapping thus keeping the GPU at its maximum compute capacity. "Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS"
Also if we look at memory bandwidth, servers have a bunch of channels while high end gaming machine limit at two: "DDR4 3200MHz with eight channels has a theoretical bandwidth of 204.8 GB/s."
What advice could you share given the experience offloading?