medicaldiffusion icon indicating copy to clipboard operation
medicaldiffusion copied to clipboard

Nan issue

Open CaiwenXu opened this issue 2 years ago • 3 comments

Hi, many thanks for your excellent work! I have a problem when training the VQ GAN, the loss will suddenly become nan, and do you know why this happens? I used the LIDC dataset.

CaiwenXu avatar Jun 02 '23 17:06 CaiwenXu

I'm currently having the same problem I used the exact same configs provided here and still no luck, very unstable training. The Model does also suffer from mode collapse after the Discriminator starts training.

benearnthof avatar Jul 25 '23 11:07 benearnthof

I believe this problem may stem from the accumulate_grad_batches parameter. I trained a run for more than 50000 steps successfully, but trying to replicate training with accumulate_grad_batches > 1 runs into the nan problem. @CWX-student can you confirm this or do you have any other info on your end?

benearnthof avatar Jul 25 '23 11:07 benearnthof

Update: Using setting the precision parameter in the config to at least 32 seems to alleviate this problem. https://discuss.pytorch.org/t/distributed-training-gives-nan-loss-but-single-gpu-training-is-fine/63664/6

benearnthof avatar Jul 25 '23 12:07 benearnthof