NeurIPS-CellSeg
NeurIPS-CellSeg copied to clipboard
Nan loss with baseline model
Hi,
I keep getting Nan loss around 20 epochs. I haven't changed anything in the code yet and just run. it keep happens even if i use other network like swinunetr, unetr.
Please make sure your monai version is 0.9.
If you are using the old MONAI, please try to remove the data augmentation
https://github.com/JunMa11/NeurIPS-CellSeg/blob/2df1ee5dc26b4ff10202da73ef22d72651e8e5bd/baseline/model_training_3class.py#L128-L148
now i'm using monai 0.9.1 but it still happened
when i check this problem by below code
this error happens
I really don't know why, but inputs gets 'nan' around 20 epochs
We cannot reproduce your error.
- please delete the data and re-run the preprocessing.
- Have you tried to remove the data augmentation?
I got the same problem as @hasukmin12 says, my envs: torch 1.10/1.11/1.12+cu113, monai 1.0/0.91/0.9, if your envs are as same as aboved, you may get the NaN loss in training.)
I have tried some solutions, and one of them can work:
[1] Remove some data augmentations that may make it run
[2] Change the torch version to 1.8 and make sure your monai is 0.9
[3] However, I think the best idea is to use the docker.