nnUNetResncUNetLPlans unable to find results for best configuration
After finishing my train, I ran
nnUNetv2_find_best_configuration 1 -p nnUNetResEncUNetLPlans -c 2d
but got this message
Traceback (most recent call last):
File "/scratch/gpfs/nc1514/specseg/.venv/bin/nnUNetv2_find_best_configuration", line 10, in <module>
sys.exit(find_best_configuration_entry_point())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/gpfs/nc1514/specseg/.venv/lib/python3.12/site-packages/nnunetv2/evaluation/find_best_configuration.py", line 296, in find_best_configuration_entry_point
find_best_configuration(dataset_name, model_dict, allow_ensembling=not args.disable_ensembling,
File "/scratch/gpfs/nc1514/specseg/.venv/lib/python3.12/site-packages/nnunetv2/evaluation/find_best_configuration.py", line 101, in find_best_configuration
accumulate_cv_results(output_folder, merged_output_folder, folds, num_processes, overwrite)
File "/scratch/gpfs/nc1514/specseg/.venv/lib/python3.12/site-packages/nnunetv2/evaluation/accumulate_cv_results.py", line 36, in accumulate_cv_results
raise RuntimeError(f"fold {f} of model {trained_model_folder} is missing. Please train it!")
RuntimeError: fold 0 of model /scratch/gpfs/nc1514/specseg/data/nnUNet/nnUNet_results/Dataset001_CO2/nnUNetTrainer__nnUNetResEncUNetLPlans__2d is missing. Please train it!
However, I already trained them as shown in the directory below.
Hi everyone. I had the same issue and I've investigated the root cause and found a workaround. @nathanchenseanwalter did you stop your training prematurely (i.e. before 1000 epochs) for your folds? In my case, this was the cause of the problem.
What happens
- The documentation tells you to run
nnUNetv2_trainwith the--npzflag in order to usennUNetv2_find_best_configuration - I interrupted the training before epoch 1000, so there is no
checkpoint_final.pth. - Similarly, due to the early stopping, the validation procedure was not performed, so there is no "validation" folder, and no .npz file. Also, since there is no
checkpoint_final.pth, you cannot simply rerunnnUNetv2_trainwith the--valflag. - When running
nnUNetv2_find_best_configuration, the functionaccumulate_cv_resultsdoes not find a "validation" folder in the "fold_n" folders, so an error is raised, asking whether the training has been performed.
The workaround
- The cleanest workaround in my opinion in this case is to make sure
checkpoint_final.pthis present and rerunnnUNetv2_trainwith the--valflag. - What makes sense to me is to copy
checkpoint_latest.pth, rename it tocheckpoint_final.pthand run the validation. - My folds were ran on an inconsistent number of epochs. I understand the implications of this on the correctness of the validation procedure and I accept it.
The true fix
- A clean approach would be of course to subclass a
nnUNetTrainerwith the correct number of epochs (or early stopping criterion) and run the full train procedure with this custom trainer. - The reason I use this workaround is that sometimes I need to run the training without knowing the proper number of epochs and 1000 is just too long. But I still want to run
nnUNetv2_find_best_configurationon the work I performed.
For @GregorKoehler, I think there is no bug here, the tool is working as intended. BUT since multiple issues like this have been opened, I think something needs to be done to adjust the false user expectation that because there are checkpoints every 50 epochs AND the --npz flag has been set, that early interruption of the training is fine and nnUNetv2_find_best_configuration can be used. Maybe something about this in how_to_use_nnunet (I can do a PR if you want)? Or a fix would be for the function maybe_load_checkpoint in run_training.py to set the expected_checkpoint_file to checkpoint_latest.pth also in the validation_only == True branch if checkpoint_final.pth fails the isfile check. But I'm not sure whether this is acceptable design-wise...
TLDR;
@nathanchenseanwalter for some reason your "validation" folder and checkpoint_final.pth is missing. You can copy checkpoint_latest.pth, rename it to checkpoint_final.pth and run nnUNetv2_train with the --val flag on your folds afterward and that should allow you to use nnUNetv2_find_best_configuration.
Thank you! @FWijanto Maybe it can give something like a warning like "You haven't completed 1000 epochs so results may not be valid"