medicaldiffusion Nan issue

Hi， many thanks for your excellent work! I have a problem when training the VQ GAN, the loss will suddenly become nan, and do you know why this happens? I used the LIDC dataset.

Jun 02 '23 17:06 CaiwenXu

I'm currently having the same problem I used the exact same configs provided here and still no luck, very unstable training. The Model does also suffer from mode collapse after the Discriminator starts training.

Jul 25 '23 11:07 benearnthof

I believe this problem may stem from the accumulate_grad_batches parameter. I trained a run for more than 50000 steps successfully, but trying to replicate training with accumulate_grad_batches > 1 runs into the nan problem. @CWX-student can you confirm this or do you have any other info on your end?

Jul 25 '23 11:07 benearnthof

Update: Using setting the precision parameter in the config to at least 32 seems to alleviate this problem. https://discuss.pytorch.org/t/distributed-training-gives-nan-loss-but-single-gpu-training-is-fine/63664/6

Jul 25 '23 12:07 benearnthof