Validation files not being saved
After successfully training two anatomical models, I've moved onto a third model which I'm having some issues with. For starters, after 110 epochs, the pseudo dice score for both the cochlea structures are 0.0, despite the others progressing well (the cochlea are very small, so possibly they are harder to train?) Secondly, my previous models saved interim predictions in a folder called "validation" within the fold_x directory. This time however, no such folder is being made. Why might this be the case, and is it something to do with the cochlea structures not being contoured?
EDIT: I included the following line in the dataset.json file:
"regions_class_order": [3,4],
where 3 and 4 are the indexes of the left and right cochlea respectively. Upon restarting training, the right cochlea started to segment, but the left did not. Can this also be explained?
@FabianIsensee @seziegler
I would like to add that I have also experienced this where the validation images at the end of training did not save for me as well. I have just been running a predict after training to manually validate.
Can anyone assist with this please?
Hello, @MattAWard. Check out this comment I made in a possibly-related issue: https://github.com/MIC-DKFZ/nnUNet/issues/2801#issuecomment-3079383406
Did you stop training early for your cochlea model, perhaps because the pseudo-Dice was not climbing (same question to you, @vmiller987)? If yes that's the reason the validation folder was not created (it is created at the end of training, 1000 epochs by default).
On another note, yes small structures will trip up the older default setting of nnUNetv2 in my experience. Your first option is to make use of the new residual encoder presets: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md Are you making use of these new presets? I've had success segmenting small structures (0.1 % of total volume) with this new preset. Your second option would be to use a focal loss instead of the default nnUNet cross-entropy loss. This is more involved to integrate in nnUNet but I was able to train a custom UNet using this before I started using nnUNet.
Contact me if you need more details.
@FWijanto My logs showed the trainings completed, and I do remember there being an error when it was supposed to be performing validation. #2838 is closer to what mine was reporting.
I am currently having hardware issues. So I haven't been able to do any proper testing. Our machine is not liking a mix of 4090s, 5090s, and 6000 Pro's.
Thanks @FWijanto. No I have only been using the standard planner thus far. With only 16Gb of VRAM currently available we are unable to run the recommended ResEnc L planner. The readme that you referenced states that nnU-Net ResEnc M has a similar GPU budget to the standard UNet configuration, so can I expect similar performance? Or is this planner different in some way that may still improve small structure delineation?
Alright, @vmiller987. That looks like an unrelated issue then. You should try the fix suggested in the issue you mentioned.
@MattAWard, I highly recommend trying the ResEncM plan. The reason is that it is an architecture modification from the standard UNet. So there is a real difference in algorithmic behaviour that might be at the core of the success I had versus the standard plan. I trained on the L plan, but I expect that the M plan will offer similar behaviour with reduced performance (although with deep learning today, the wisest approach is still "try it out and see"). If I had to bet, the difference between the standard plan and the ResEncM plan would be a Dice of 0 versus 0.6, and the difference between ResEncM and ResEncL might be Dice of 0.6 versus 0.7 (numbers drawn arbitrarily). I will look into this architecture, I can't offer a theoretical or intuitive reason why it offers such a drastic improvement for my use case yet. But keep us updated on the results if you give it a try!
Thanks @FWijanto, I gave the ResEncM planner a shot and after 80 epochs the average epoch time was nearly 4 hours, meaning it would take a year and a half to train a 5 fold model! The commands I used were
nnUNetv2_plan_and_preprocess -d XXX -c 3d_fullres -pl nnUNetPlannerResEncM --verify_dataset_integrity nnUNetv2_train XXX 3d_fullres 0 -tr nnUNetTrainerNoMirroring -p nnUNetResEncUNetMPlans --c
the --c was to continue from a previous checkpoint
My guess is that 16Gb VRAM on an RTX A4000 isn't enough compute power? But I'm amazed at the difference in speed since I typically get an epoch time of around 2 minutes using the standard setup.
Hi @MattAWard, looking at this, I have a few ideas.
Are you storing your training data on HDD or networked storage and not a local SSD (probably not, but just checking. NVME is best versus SATA)? Second, since you are resuming training from a checkpoint, was the original training done on a non-ResEnc planner (I would assume this would throw errors, but likewise, just checking)?
Then, my first port of call would be to increase the GPU memory budget to 16 in your case using the -gpu_memory_target flag (the steps are described at https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md). But I don't have high confidence this would be enough to solve the problem. It is usually a problem of mismatch between GPU throughput (training) and CPU throughput (data augmentation). If possible, you should track GPU and CPU usage (in %) especially around the point of epoch explosion. See if that can help you diagnose the issue. Also see if you can attribute more cpu cores using the nnUNet_n_proc_DA environment variable if CPU is a bottleneck.
You can also try to run the nnUNet benchmarks (https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/benchmarking.md also good general performance troubleshooting there) and see whether you are underperforming versus their target, which would signal not that your GPU is incapable, but that there is a bottleneck in your setup.
If you want a low-risk, low-probability issue try running training not with the nnUNetTrainerNoMirroring trainer, but with the standard trainer and see if the issue still persists.
Good luck and report your findings!