NeurIPS-CellSeg icon indicating copy to clipboard operation
NeurIPS-CellSeg copied to clipboard

Nan loss with baseline model

Open hasukmin12 opened this issue 2 years ago • 7 comments

Hi,

I keep getting Nan loss around 20 epochs. I haven't changed anything in the code yet and just run. it keep happens even if i use other network like swinunetr, unetr.

image

hasukmin12 avatar Sep 29 '22 23:09 hasukmin12

Please make sure your monai version is 0.9.

JunMa11 avatar Sep 29 '22 23:09 JunMa11

If you are using the old MONAI, please try to remove the data augmentation

https://github.com/JunMa11/NeurIPS-CellSeg/blob/2df1ee5dc26b4ff10202da73ef22d72651e8e5bd/baseline/model_training_3class.py#L128-L148

JunMa11 avatar Sep 29 '22 23:09 JunMa11

now i'm using monai 0.9.1 but it still happened

hasukmin12 avatar Sep 30 '22 00:09 hasukmin12

when i check this problem by below code image

this error happens image

hasukmin12 avatar Sep 30 '22 00:09 hasukmin12

I really don't know why, but inputs gets 'nan' around 20 epochs image

hasukmin12 avatar Sep 30 '22 00:09 hasukmin12

We cannot reproduce your error.

  • please delete the data and re-run the preprocessing.
  • Have you tried to remove the data augmentation?

JunMa11 avatar Oct 08 '22 18:10 JunMa11

I got the same problem as @hasukmin12 says, my envs: torch 1.10/1.11/1.12+cu113, monai 1.0/0.91/0.9, if your envs are as same as aboved, you may get the NaN loss in training.)

I have tried some solutions, and one of them can work:

[1] Remove some data augmentations that may make it run

[2] Change the torch version to 1.8 and make sure your monai is 0.9

[3] However, I think the best idea is to use the docker.

JintuZheng avatar Oct 17 '22 05:10 JintuZheng