diffusion Grad become NaN

When training on my local machine (3090 24Gb) with batch size 12, grad value become NaN after few steps But I don't meet this when training on Google Cloud A100 40Gb with bs 20. Why? How I fix that?

Jul 19 '23 06:07 tungdq212

If you aren't seeing NaNs with larger batch sizes, I would recommend keeping the batch size high (2048 if you want to mimic our experiment) and set device_train_microbatch_size to the largest value before an OOM, in your case sounds like 12. device_train_microbatch_size is related to gradient accumulation where the amount of gradient accumulation is equal to batch_size // device_train_microbatch_size. This does multiple forward passes with a smaller set of samples, then a single backward pass when batch_size number of samples have been processed.

This is mathematically equivalent to training on a large batch size at once as long as the network does not have batch norm layers. Let me know if this works!

Extra: you can set device_train_microbatch_size to 'auto' and composer will decrease the microbatch size until it fits into memory. But this is an experimental feature, so it may not work out of the box for your use-case

Jul 20 '23 18:07 Landanjs

Hello, did you run on 3090 (24GB) device?

Aug 30 '23 10:08 YMKiii

No, we have used A100s (40GB/80GB) and H100s (80GB)

Aug 30 '23 21:08 Landanjs