fsdp_qlora icon indicating copy to clipboard operation
fsdp_qlora copied to clipboard

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU

Open Iron-Bound opened this issue 11 months ago • 0 comments

Hey, I'm loving the goal of lowering the resource requirements for training!

In this paper https://arxiv.org/abs/2403.06504 they claim direct memory access between the GPU<->Nvme Storage is more efficient at swapping thus keeping the GPU at its maximum compute capacity. "Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS"

Also if we look at memory bandwidth, servers have a bunch of channels while high end gaming machine limit at two: "DDR4 3200MHz with eight channels has a theoretical bandwidth of 204.8 GB/s."

What advice could you share given the experience offloading?

Iron-Bound avatar Mar 13 '24 11:03 Iron-Bound