nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

Training RuntimeError: One or more background workers are no longer alive

Open nrepina opened this issue 8 months ago • 8 comments

Hello, I am trying to run training with the command:

nnUNetv2_train 3 2d 0 --npz

in python 3.10, on the Nvidia Tesla A100 GPU, on our computing cluster requesting 48GB memory. I've tried decreasing the batch size of the dataset to 10. This should certainly be enough memory for a 2D training (I see max memory usage for the job is 2.2GB). However the job keeps erroring out within a minute of starting the training run. I see this is a common issue, e.g #2297 #2595 #2749. Any advice would be much appreciated.

Please see logs below:


2025-04-02 15:31:28.731068: The split file contains 5 splits. 2025-04-02 15:31:28.743055: Desired fold for training: 0 2025-04-02 15:31:28.752051: This split has 13 training and 4 validation cases. using pin_memory on device 0 Max memory usage in bytes: 2230145024

[INFO] [2025-04-02T15:31:43+02:00] [1549827] Workflow finished with code [INFO] [2025-04-02T15:31:43+02:00] [1549827] Workflow execution time (seconds) : 71


Error logs:

Exception in thread Thread-3 (results_loop): Traceback (most recent call last): File "/path/Python/mamba-envs/nnunet/lib/python3.10/threading.py", line 1016, in _bootstrap_inner Traceback (most recent call last): File "/path/Python/mamba-envs/nnunet/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 267, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 207, in run_training nnunet_trainer.run_training() File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1363, in run_training self.run() File "/path/Python/mamba-envs/nnunet/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop self.on_train_start() File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 900, in on_train_start raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message self.dataloader_train, self.dataloader_val = self.get_dataloaders() File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 690, in get_dataloaders _ = next(mt_gen_train) File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message


Environment (mamba) info:

Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge acvl-utils 0.2.5 pypi_0 pypi anyio 4.9.0 pypi_0 pypi aom 3.9.1 hac33072_0 conda-forge argparse 1.4.0 pypi_0 pypi attr 2.5.1 h166bdaf_1 conda-forge batchgenerators 0.25.1 pypi_0 pypi batchgeneratorsv2 0.2.3 pypi_0 pypi blosc2 3.0.0b4 pypi_0 pypi bzip2 1.0.8 h4bc722e_7 conda-forge ca-certificates 2025.1.31 hbcca054_0 conda-forge certifi 2025.1.31 pypi_0 pypi charset-normalizer 3.4.1 pypi_0 pypi connected-components-3d 3.23.0 pypi_0 pypi contourpy 1.3.1 pypi_0 pypi cuda-crt-tools 12.8.93 ha770c72_1 conda-forge cuda-cudart 12.8.90 h5888daf_1 conda-forge cuda-cudart_linux-64 12.8.90 h3f2d84a_1 conda-forge cuda-cuobjdump 12.8.90 hbd13f7d_1 conda-forge cuda-cupti 12.8.90 hbd13f7d_0 conda-forge cuda-nvcc-tools 12.8.93 he02047a_1 conda-forge cuda-nvdisasm 12.8.90 hbd13f7d_1 conda-forge cuda-nvrtc 12.8.93 h5888daf_1 conda-forge cuda-nvtx 12.8.90 hbd13f7d_0 conda-forge cuda-nvvm-tools 12.8.93 he02047a_1 conda-forge cuda-version 12.8 h5d125a7_3 conda-forge cudnn 9.8.0.87 h81d5506_0 conda-forge cusparselt 0.7.0.0 hcd2ec93_0 conda-forge cycler 0.12.1 pypi_0 pypi dav1d 1.2.1 hd590300_0 conda-forge dicom2nifti 2.6.0 pypi_0 pypi dynamic-network-architectures 0.3.1 pypi_0 pypi einops 0.8.1 pypi_0 pypi exceptiongroup 1.2.2 pypi_0 pypi fft-conv-pytorch 1.2.0 pypi_0 pypi filelock 3.18.0 pyhd8ed1ab_0 conda-forge fonttools 4.56.0 pypi_0 pypi freetype 2.13.3 h48d6fc4_0 conda-forge fsspec 2025.3.2 pyhd8ed1ab_0 conda-forge future 1.0.0 pypi_0 pypi giflib 5.2.2 hd590300_0 conda-forge gmp 6.3.0 hac33072_2 conda-forge gmpy2 2.1.5 py310he8512ff_3 conda-forge h11 0.14.0 pypi_0 pypi httpcore 1.0.7 pypi_0 pypi httpx 0.28.1 pypi_0 pypi idna 3.10 pypi_0 pypi imagecodecs 2025.3.30 pypi_0 pypi imageio 2.37.0 pypi_0 pypi importlib-resources 6.5.2 pypi_0 pypi jinja2 3.1.6 pyhd8ed1ab_0 conda-forge joblib 1.4.2 pypi_0 pypi kiwisolver 1.4.8 pypi_0 pypi lazy-loader 0.4 pypi_0 pypi lcms2 2.17 h717163a_0 conda-forge ld_impl_linux-64 2.43 h712a8e2_4 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libabseil 20240722.0 cxx17_hbbce691_4 conda-forge libavif16 1.2.1 hbb36593_2 conda-forge libblas 3.9.0 31_h59b9bed_openblas conda-forge libcap 2.75 h39aace5_0 conda-forge libcblas 3.9.0 31_he106b2a_openblas conda-forge libcublas 12.8.4.1 h9ab20c4_1 conda-forge libcudss0 0.4.0.2 he55f5cd_2 conda-forge libcufft 11.3.3.83 h5888daf_1 conda-forge libcufile 1.13.1.3 h12f29b5_0 conda-forge libcurand 10.3.9.90 h9ab20c4_1 conda-forge libcusolver 11.7.3.90 h9ab20c4_1 conda-forge libcusparse 12.5.8.93 hbd13f7d_0 conda-forge libde265 1.0.15 h00ab1b0_0 conda-forge libdeflate 1.23 h4ddbbb0_0 conda-forge libffi 3.4.6 h2dba641_1 conda-forge libgcc 14.2.0 h767d61c_2 conda-forge libgcc-ng 14.2.0 h69a702a_2 conda-forge libgcrypt-lib 1.11.0 hb9d3cd8_2 conda-forge libgfortran 14.2.0 h69a702a_2 conda-forge libgfortran5 14.2.0 hf1ad2bd_2 conda-forge libgomp 14.2.0 h767d61c_2 conda-forge libgpg-error 1.51 hbd13f7d_1 conda-forge libheif 1.19.7 gpl_hc18d805_100 conda-forge libiconv 1.18 h4ce23a2_1 conda-forge libjpeg-turbo 3.0.0 hd590300_1 conda-forge liblapack 3.9.0 31_h7ac8fdf_openblas conda-forge libllvm20 20.1.1 ha7bfdaf_0 conda-forge liblzma 5.6.4 hb9d3cd8_0 conda-forge libmagma 2.8.0 h566cb83_2 conda-forge libnl 3.11.0 hb9d3cd8_0 conda-forge libnsl 2.0.1 hd590300_0 conda-forge libnvjitlink 12.8.93 h5888daf_1 conda-forge libnvjpeg 12.3.5.92 h97fd463_0 conda-forge libopenblas 0.3.29 pthreads_h94d23a6_0 conda-forge libpng 1.6.47 h943b412_0 conda-forge libprotobuf 5.28.3 h6128344_1 conda-forge libsqlite 3.49.1 hee588c1_2 conda-forge libstdcxx 14.2.0 h8f9b012_2 conda-forge libstdcxx-ng 14.2.0 h4852527_2 conda-forge libsystemd0 257.4 h4e0b6ca_1 conda-forge libtiff 4.7.0 hd9ff511_3 conda-forge libtorch 2.6.0 cuda126_generic_h4a15719_200 conda-forge libudev1 257.4 hbe16f8c_1 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libuv 1.50.0 hb9d3cd8_0 conda-forge libwebp-base 1.5.0 h851e524_0 conda-forge libxcb 1.17.0 h8a09558_0 conda-forge libxcrypt 4.4.36 hd590300_1 conda-forge libxml2 2.13.7 h0d44e9d_0 conda-forge libzlib 1.3.1 hb9d3cd8_2 conda-forge linecache2 1.0.0 pypi_0 pypi lz4-c 1.10.0 h5888daf_1 conda-forge markupsafe 3.0.2 py310h89163eb_1 conda-forge matplotlib 3.10.1 pypi_0 pypi mpc 1.3.1 h24ddda3_1 conda-forge mpfr 4.2.1 h90cbb55_3 conda-forge mpmath 1.3.0 pyhd8ed1ab_1 conda-forge msgpack 1.1.0 pypi_0 pypi nccl 2.26.2.1 ha44e49d_0 conda-forge ncurses 6.5 h2d0b736_3 conda-forge ndindex 1.9.2 pypi_0 pypi networkx 3.4.2 pyh267e887_2 conda-forge nibabel 5.3.2 pypi_0 pypi nnunetv2 2.6.0 pypi_0 pypi nomkl 1.0 h5ca1d4c_0 conda-forge numexpr 2.10.2 pypi_0 pypi numpy 2.2.4 py310hefbff90_0 conda-forge openjpeg 2.5.3 h5fbd93e_0 conda-forge openssl 3.4.1 h7b32b05_0 conda-forge optree 0.14.1 py310h3788b33_1 conda-forge packaging 24.2 pypi_0 pypi pandas 2.2.3 pypi_0 pypi pillow 11.1.0 py310h7e6dc6c_0 conda-forge pip 25.0.1 pyh8b19718_0 conda-forge pthread-stubs 0.4 hb9d3cd8_1002 conda-forge py-cpuinfo 9.0.0 pypi_0 pypi pybind11 2.13.6 pyh1ec8472_2 conda-forge pybind11-global 2.13.6 pyh415d2e4_2 conda-forge pydicom 3.0.1 pypi_0 pypi pyparsing 3.2.3 pypi_0 pypi python 3.10.16 he725a3c_1_cpython conda-forge python-dateutil 2.9.0.post0 pypi_0 pypi python-gdcm 3.0.24.1 pypi_0 pypi python-graphviz 0.20.3 pypi_0 pypi python_abi 3.10 6_cp310 conda-forge pytorch 2.6.0 cuda126_generic_py310_h9bb2754_200 conda-forge pytz 2025.2 pypi_0 pypi pyyaml 6.0.2 pypi_0 pypi rav1e 0.6.6 he8a937b_2 conda-forge rdma-core 56.0 h5888daf_0 conda-forge readline 8.2 h8c095d6_2 conda-forge requests 2.32.3 pypi_0 pypi scikit-image 0.25.2 pypi_0 pypi scikit-learn 1.6.1 pypi_0 pypi scipy 1.15.2 pypi_0 pypi seaborn 0.13.2 pypi_0 pypi setuptools 75.8.2 pyhff2d567_0 conda-forge simpleitk 2.4.1 pypi_0 pypi six 1.17.0 pypi_0 pypi sleef 3.8 h1b44611_0 conda-forge sniffio 1.3.1 pypi_0 pypi svt-av1 3.0.2 h5888daf_0 conda-forge sympy 1.13.3 pypyh2585a3b_103 conda-forge threadpoolctl 3.6.0 pypi_0 pypi tifffile 2025.3.30 pypi_0 pypi tk 8.6.13 noxft_h4845f30_101 conda-forge torchvision 0.21.0 cuda126_py310_h4459643_1 conda-forge torchvision-extra-decoders 0.0.2 py310h9a3ef1b_2 conda-forge tqdm 4.67.1 pypi_0 pypi traceback2 1.4.0 pypi_0 pypi triton 3.2.0 cuda126py310h50ec074_1 conda-forge typing-extensions 4.13.0 h9fa5a19_1 conda-forge typing_extensions 4.13.0 pyh29332c3_1 conda-forge tzdata 2025.2 pypi_0 pypi unittest2 1.1.0 pypi_0 pypi urllib3 2.3.0 pypi_0 pypi wheel 0.45.1 pyhd8ed1ab_1 conda-forge x265 3.5 h924138e_3 conda-forge xorg-libxau 1.0.12 hb9d3cd8_0 conda-forge xorg-libxdmcp 1.1.5 hb9d3cd8_0 conda-forge yacs 0.1.8 pypi_0 pypi zstd 1.5.7 hb8e6e7a_2 conda-forge

nrepina avatar Apr 02 '25 14:04 nrepina

Update: As per #2749, #2659 setting environment variable below export nnUNet_n_proc_DA=0 allowed training to proceed to epoch 0, but soon after get this dynamo error:

/tmp/tmprc_koqgt/main.c:1:10: fatal error: cuda.h: No such file or directory
 #include "cuda.h"
          ^~~~~~~~
compilation terminated.
Traceback (most recent call last):
  File "/path/Python/mamba-envs/nnunet/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 267, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/nnunetv2/run/run_training.py", line 207, in run_training
    nnunet_trainer.run_training()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1371, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 989, in train_step
    output = self.network(data)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 574, in _fn
    return fn(*args, **kwargs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1380, in __call__
    return self._torchdynamo_orig_callable(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1164, in __call__
    result = self._inner_convert(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 547, in __call__
    return _compile(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 986, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 715, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_utils_internal.py", line 95, in wrapper_function
    return function(*args, **kwargs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 750, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object
    transformations(instructions, code_options)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 231, in _fn
    return fn(*args, **kwargs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 662, in transform
    tracer.run()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2868, in run
    super().run()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1052, in run
    while self.step():
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 962, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 3048, in RETURN_VALUE
    self._return(inst)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 3033, in _return
    self.output.compile_subgraph(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1136, in compile_subgraph
    self.compile_and_call_fx_graph(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1382, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1432, in call_user_compiler
    return self._call_user_compiler(gm)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1483, in _call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1462, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/__init__.py", line 2340, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1863, in compile_fx
    return aot_autograd(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 83, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1155, in aot_module_simplified
    compiled_fn = dispatch_and_compile()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 678, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 489, in __call__
    return self.compiler_fn(gm, example_inputs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1741, in fw_compiler_base
    return inner_compile(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 569, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 685, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2027, in compile_to_module
    return self._compile_to_module()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2033, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1968, in codegen
    self.scheduler.codegen()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 3477, in codegen
    return self._codegen()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 3554, in _codegen
    self.get_backend(device).codegen_node(node)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 80, in codegen_node
    return self._triton_scheduling.codegen_node(node)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/codegen/simd.py", line 1219, in codegen_node
    return self.codegen_node_schedule(
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/codegen/simd.py", line 1263, in codegen_node_schedule
    src_code = kernel.codegen_kernel()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/codegen/triton.py", line 3154, in codegen_kernel
    **self.inductor_meta_common(),
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/_inductor/codegen/triton.py", line 3013, in inductor_meta_common
    "backend_hash": torch.utils._triton.triton_hash_with_backend(),
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/utils/_triton.py", line 111, in triton_hash_with_backend
    backend = triton_backend()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/torch/utils/_triton.py", line 103, in triton_backend
    target = driver.active.get_current_target()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 432, in __init__
    self.utils = CudaUtils()  # TODO: make static
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/triton/runtime/build.py", line 71, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/path/Python/mamba-envs/nnunet/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmprc_koqgt/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmprc_koqgt/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-lcuda', '-L/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/triton/backends/nvidia/lib', '-L/lib64', '-I/path/Python/mamba-envs/nnunet/lib/python3.10/site-packages/triton/backends/nvidia/include', '-I/tmp/tmprc_koqgt', '-I/path/Python/mamba-envs/nnunet/include/python3.10', '-I/path/Python/mamba-envs/nnunet/targets/x86_64-linux/include']' returned non-zero exit status 1.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Following the suggestion from #2712 #2607 #2595, setting nnUNet_compile=False removes the dynamo error and training is running for now - fingers crossed!

nrepina avatar Apr 02 '25 21:04 nrepina

Hello, I am facing the same issue when running the nnUNetv2_train command.

Error logs:

2025-04-03 11:19:53.982917: unpacking dataset... 2025-04-03 11:19:54.315469: unpacking done... 2025-04-03 11:19:54.317708: do_dummy_2d_data_aug: False 2025-04-03 11:19:54.329118: Using splits from existing split file: /home/luna.kuleuven.be/u0121257/nnUNet_preprocessed/Dataset525_H&ENSCLC/splits_final.json 2025-04-03 11:19:54.331421: The split file contains 5 splits. 2025-04-03 11:19:54.331486: Desired fold for training: 0 2025-04-03 11:19:54.331519: This split has 2144 training and 536 validation cases. 2025-04-03 11:19:57.674011: Unable to plot network architecture: 2025-04-03 11:19:57.674453: module 'torch.onnx' has no attribute '_optimize_trace' 2025-04-03 11:19:57.705627: 2025-04-03 11:19:57.705705: Epoch 0 2025-04-03 11:19:57.705839: Current learning rate: 0.01 using pin_memory on device 0 Exception in background worker 7: Cannot load file containing pickled data when allow_pickle=False Traceback (most recent call last): File "/path/miniconda3/envs/nnUNet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/path/miniconda3/envs/nnUNet/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in next return self.generate_train_batch() File "/path/nnUNet/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py", line 18, in generate_train_batch data, seg, properties = self._data.load_case(current_key) File "/path/nnUNet/nnUNet/nnunetv2/training/dataloading/nnunet_dataset.py", line 97, in load_case seg = np.load(entry['data_file'][:-4] + "_seg.npy", 'r') File "/path/miniconda3/envs/nnUNet/lib/python3.10/site-packages/numpy/lib/npyio.py", line 438, in load raise ValueError("Cannot load file containing pickled data " ValueError: Cannot load file containing pickled data when allow_pickle=False Traceback (most recent call last): File "/path/miniconda3/envs/nnUNet/bin/nnUNetv2_train", line 33, in sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')()) File "/path/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 247, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/path/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 190, in run_training nnunet_trainer.run_training() File "/path/nnUNet/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1210, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/path/miniconda3/envs/nnUNet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/path/miniconda3/envs/nnUNet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

mrahimpour avatar Apr 03 '25 09:04 mrahimpour

I know this issue is closed, but just wanted to say that @seziegler's solution from issue #2712 fixed my issue when I got the following error message (copy-pasted below) using the 2d config.

It is the first time I try the 2d config (new project) and I have been using the 3d_fullres config for a while and never faced this issue. Maybe it is an issue particular to the 2d data_loading process only?

Error message (ignore the first few lines where a self-made script to automatize all folds is printing stuff):

""" Executing command: nnUNetv2_train 016 2d 0 --npz -p UNet_ResEncL_3d-interp -tr nnUNetTrainer_500epochs_adamW --val_best | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Starting training for dataset 016, fold 0, plan UNet_ResEncL_3d-interp ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

  warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "
Traceback (most recent call last):
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer
    item = next(data_loader)
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__
    return self.generate_train_batch()
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py", line 21, in generate_train_batch
    data, seg, properties = self._data.load_case(current_key)
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case
    data = np.load(entry['data_file'][:-4] + ".npy", 'r')
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/lib/npyio.py", line 453, in load
    return format.open_memmap(file, mode=mmap_mode,
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/lib/format.py", line 945, in open_memmap
    marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order,
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/core/memmap.py", line 268, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
ValueError: mmap length is greater than file size
Traceback (most recent call last):
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer
    item = next(data_loader)
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__
    return self.generate_train_batch()
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py", line 21, in generate_train_batch
    data, seg, properties = self._data.load_case(current_key)
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case
    data = np.load(entry['data_file'][:-4] + ".npy", 'r')
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/lib/npyio.py", line 453, in load
    return format.open_memmap(file, mode=mmap_mode,
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/lib/format.py", line 945, in open_memmap
    marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order,
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/core/memmap.py", line 268, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
ValueError: mmap length is greater than file size
Traceback (most recent call last):
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer
    item = next(data_loader)
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__
    return self.generate_train_batch()
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py", line 21, in generate_train_batch
    data, seg, properties = self._data.load_case(current_key)
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case
    data = np.load(entry['data_file'][:-4] + ".npy", 'r')
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/lib/npyio.py", line 453, in load
    return format.open_memmap(file, mode=mmap_mode,
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/lib/format.py", line 945, in open_memmap
    marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order,
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/core/memmap.py", line 268, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
ValueError: mmap length is greater than file size
Traceback (most recent call last):
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer
    item = next(data_loader)
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in __next__
    return self.generate_train_batch()
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py", line 21, in generate_train_batch
    data, seg, properties = self._data.load_case(current_key)
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/training/dataloading/nnunet_dataset.py", line 86, in load_case
    data = np.load(entry['data_file'][:-4] + ".npy", 'r')
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/lib/npyio.py", line 453, in load
    return format.open_memmap(file, mode=mmap_mode,
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/lib/format.py", line 945, in open_memmap
    marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order,
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/numpy/core/memmap.py", line 268, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
ValueError: mmap length is greater than file size
Traceback (most recent call last):
  File "/home/marcantf/miniconda3/envs/nnunet/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/run/run_training.py", line 274, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/run/run_training.py", line 210, in run_training
    nnunet_trainer.run_training()
  File "/home/marcantf/Code/PhD-python/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
  File "/home/marcantf/miniconda3/envs/nnunet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

""" 

mafortin avatar Apr 03 '25 10:04 mafortin

I've been experiencing the same issue while training on a cluster using specific GPU models such as the A100 and A40. I'm wondering if there's a solution to ensure compatibility.

Thank you!

RRouhi avatar Apr 24 '25 18:04 RRouhi

Which one of you still has this issue? If you have it, please specify your operating system, linux distro and whether you are running nnU-Net in a docker

FabianIsensee avatar May 28 '25 20:05 FabianIsensee

as mentioned above, I fixed it by typing nnUNet_compile=False before running the same command in the same terminal window. However, I quickly moved away from the 2d config and focused on the 3d_fullres, which never prompt such issues on my side.

mafortin avatar May 29 '25 12:05 mafortin

ValueError: mmap length is greater than file size is a standard error that happens if the preprocessed data is corrupted. You should fix it by rerunning preprocessing. This should not be related to nnUNet_compile. Since compile gives you 25% speed for free I strongly recommend you leave it enabled and try to fix any bug arising from it

FabianIsensee avatar May 30 '25 11:05 FabianIsensee

Which one of you still has this issue? If you have it, please specify your operating system, linux distro and whether you are running nnU-Net in a docker

I am still experiencing this issue; OS: Linux, not using a docker!

mrahimpour avatar May 30 '25 13:05 mrahimpour