hivemind
hivemind copied to clipboard
[Feature Request] fp16/bf16 gpu params with fp32 offloading in hivemind.Optimizer
It's something we played with a few times but did not end up merging to master. I'm creating this issue so we wouldn't forget it. It would be great if hivemind.Optimizer would correctly handle use case where user would convert some or all model params to FP16 / BF16 after the optimizer was created (with offloading).
This mode allows for to the most gpu-memory-efficient AMP training scenario and can even accelerate cpu-gpu data transfer.
Implementing this requires the following changes:
- GradientAverager:
- if local accumulators are non-fp32, make sure we (1) keep averaged grads float32 (2) do not raise errors
- when copying gpu grads => offloaded grads, first copy low-precision data to cpu, then cast to dtype
- when copying offloaded grads => gpu grads, first cast data to gpu tensor's type, then copy to GPU
- StateAverager:
- make sure we keep offloaded master params in fp32 and do not raise errors if user casts some gpu params to fp16/bf16
- optimize data movement: same as GradientAverager, but for parameters instead of grads
- GradScaler:
- hivemind.GradScaler must correctly unscale fp16/bf16 grads, especially when their offloaded version is fp32