nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

nnUNetResncUNetLPlans unable to find results for best configuration

Open nathanchenseanwalter opened this issue 7 months ago • 2 comments

After finishing my train, I ran

nnUNetv2_find_best_configuration 1 -p nnUNetResEncUNetLPlans -c 2d

but got this message

Traceback (most recent call last):
  File "/scratch/gpfs/nc1514/specseg/.venv/bin/nnUNetv2_find_best_configuration", line 10, in <module>
    sys.exit(find_best_configuration_entry_point())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/gpfs/nc1514/specseg/.venv/lib/python3.12/site-packages/nnunetv2/evaluation/find_best_configuration.py", line 296, in find_best_configuration_entry_point
    find_best_configuration(dataset_name, model_dict, allow_ensembling=not args.disable_ensembling,
  File "/scratch/gpfs/nc1514/specseg/.venv/lib/python3.12/site-packages/nnunetv2/evaluation/find_best_configuration.py", line 101, in find_best_configuration
    accumulate_cv_results(output_folder, merged_output_folder, folds, num_processes, overwrite)
  File "/scratch/gpfs/nc1514/specseg/.venv/lib/python3.12/site-packages/nnunetv2/evaluation/accumulate_cv_results.py", line 36, in accumulate_cv_results
    raise RuntimeError(f"fold {f} of model {trained_model_folder} is missing. Please train it!")
RuntimeError: fold 0 of model /scratch/gpfs/nc1514/specseg/data/nnUNet/nnUNet_results/Dataset001_CO2/nnUNetTrainer__nnUNetResEncUNetLPlans__2d is missing. Please train it!

However, I already trained them as shown in the directory below.

Image

nathanchenseanwalter avatar May 08 '25 22:05 nathanchenseanwalter

Hi everyone. I had the same issue and I've investigated the root cause and found a workaround. @nathanchenseanwalter did you stop your training prematurely (i.e. before 1000 epochs) for your folds? In my case, this was the cause of the problem.

What happens

  • The documentation tells you to run nnUNetv2_train with the --npz flag in order to use nnUNetv2_find_best_configuration
  • I interrupted the training before epoch 1000, so there is no checkpoint_final.pth.
  • Similarly, due to the early stopping, the validation procedure was not performed, so there is no "validation" folder, and no .npz file. Also, since there is no checkpoint_final.pth, you cannot simply rerun nnUNetv2_train with the --val flag.
  • When running nnUNetv2_find_best_configuration, the function accumulate_cv_results does not find a "validation" folder in the "fold_n" folders, so an error is raised, asking whether the training has been performed.

The workaround

  • The cleanest workaround in my opinion in this case is to make sure checkpoint_final.pth is present and rerun nnUNetv2_train with the --val flag.
  • What makes sense to me is to copy checkpoint_latest.pth, rename it to checkpoint_final.pth and run the validation.
  • My folds were ran on an inconsistent number of epochs. I understand the implications of this on the correctness of the validation procedure and I accept it.

The true fix

  • A clean approach would be of course to subclass a nnUNetTrainer with the correct number of epochs (or early stopping criterion) and run the full train procedure with this custom trainer.
  • The reason I use this workaround is that sometimes I need to run the training without knowing the proper number of epochs and 1000 is just too long. But I still want to run nnUNetv2_find_best_configuration on the work I performed.

For @GregorKoehler, I think there is no bug here, the tool is working as intended. BUT since multiple issues like this have been opened, I think something needs to be done to adjust the false user expectation that because there are checkpoints every 50 epochs AND the --npz flag has been set, that early interruption of the training is fine and nnUNetv2_find_best_configuration can be used. Maybe something about this in how_to_use_nnunet (I can do a PR if you want)? Or a fix would be for the function maybe_load_checkpoint in run_training.py to set the expected_checkpoint_file to checkpoint_latest.pth also in the validation_only == True branch if checkpoint_final.pth fails the isfile check. But I'm not sure whether this is acceptable design-wise...

TLDR;

@nathanchenseanwalter for some reason your "validation" folder and checkpoint_final.pth is missing. You can copy checkpoint_latest.pth, rename it to checkpoint_final.pth and run nnUNetv2_train with the --val flag on your folds afterward and that should allow you to use nnUNetv2_find_best_configuration.

FWijanto avatar Jul 16 '25 16:07 FWijanto

Thank you! @FWijanto Maybe it can give something like a warning like "You haven't completed 1000 epochs so results may not be valid"

nathanchenseanwalter avatar Oct 24 '25 13:10 nathanchenseanwalter