LOMO Gradient accumulation

Gradient accumulation

Open EladDv opened this issue 1 year ago • 1 comments

Is gradient accumulation still mathematically possible in this scheme? Because if so we could train 65b on a single 3090 in a day and a half

Jun 19 '23 17:06 EladDv

No, the gradient is no longer perseved in the GPU memory. If you offload the gradient tensor to the CPU memory or NVME, there is a large cost of transportation.

Jun 19 '23 17:06 QipengGuo

Based on the information provided, we consider this issue resolved. If you have any further questions or concerns, please reopen this issue and provide additional details.

Jun 24 '23 16:06 KaiLv69

LOMO LOMO copied to clipboard

Gradient accumulation

LOMO
LOMO copied to clipboard