LOMO
LOMO copied to clipboard
Gradient accumulation
Is gradient accumulation still mathematically possible in this scheme? Because if so we could train 65b on a single 3090 in a day and a half
No, the gradient is no longer perseved in the GPU memory. If you offload the gradient tensor to the CPU memory or NVME, there is a large cost of transportation.
Based on the information provided, we consider this issue resolved. If you have any further questions or concerns, please reopen this issue and provide additional details.