DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Concern around mixed precision training where weights are in low precision

Open ethansmith2000 opened this issue 1 year ago • 2 comments

I noticed that in deepspeed, when training with fp16 and bf16, weights are set to the lower precision. I am wondering if there is any chance of making this optional. For both bf16 and fp16 there is the risk of having the weight change "dissapear" due to the low precision

This paper brought first brought the issue to my attention: https://arxiv.org/abs/2010.06192

Empirically, have found a lot of diffusion model training to have small gradient norms, often around 0.02 or so. In BF16 and possibly even fp16 it appears that this optimization step may not even register. Screenshot 2024-03-23 at 9 30 31 PM

In fp16, more bytes are allocated to the mantissa so its less risky but still seems like a potential issue.

fetching the dtype of the optimizer states or model weights does show they are in the reduced precision but to make sure i also checked the gpu memory usage. The below is zero stage-1 training of SDXL Screenshot 2024-03-24 at 2 32 09 AM

Additionally, I had previously mentioned here that the Deepspeed BERT training example suffers significant performance loss when running in bf16 Screenshot 2024-03-24 at 3 08 22 AM

ethansmith2000 avatar Mar 24 '24 01:03 ethansmith2000

wanted to link this one here too https://github.com/Lightning-AI/pytorch-lightning/issues/18016

ethansmith2000 avatar Jun 12 '24 07:06 ethansmith2000

found the solution sire?

SonicCodes avatar Sep 24 '24 13:09 SonicCodes

Any solution yet?

Monohydroxides avatar Oct 18 '24 08:10 Monohydroxides

@ethansmith2000, apologies for delayed response. Assuming this remains relevant, please see my comments below.

I noticed that in deepspeed, when training with fp16 and bf16, weights are set to the lower precision. I am wondering if there is any chance of making this optional.

Can you clarify your concerns? Mixed-precision with ZeRO involves forward/backward computation in lower-precision, and optimizer computation in fp32 precision. Is your expectation different?

fetching the dtype of the optimizer states or model weights does show they are in the reduced precision

Can you describe how you observed that optimizer state is in reduced precision? The ZeRO design is to keep the master weights and optimizer states in fp32 precision.

@SonicCodes and @Monohydroxides, FYI

tjruwase avatar Dec 28 '24 14:12 tjruwase

@tjruwase

Can you describe how you observed that optimizer state is in reduced precision? The ZeRO design is to keep the master weights and optimizer states in fp32 precision.

I observed that the optimizer states are in fp32, and the forward/backward are in fp16/bf16. At the time, I believed the forward and backward processes could be carried out with the precision we desired—for example, keeping certain blocks’ parameters in FP32 while others were in FP16/BF16. However, this is currently not feasible.

Monohydroxides avatar Dec 28 '24 16:12 Monohydroxides