nnUNet
nnUNet copied to clipboard
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Cannot find a working triton installation. More information on installing Triton can be found at https://github.com/openai/triton
I don't know what's going on, reporting this kind of error. Everything is normal before the training, this problem suddenly occurred, can you help me look at it?
2024-04-20 08:27:16.276530: Epoch 600
2024-04-20 08:27:16.276754: Current learning rate: 0.00438
Traceback (most recent call last):
File "/opt/conda/bin/nnUNetv2_train", line 8, in
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True
Exception in thread Thread-3 (results_loop): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception in thread Thread-2 (results_loop): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Tried a lot of methods, did not get a good solution, hard to help me take a look. Thank you @mrokuss
If I use nnunetv2-2.3.1 version of the code is able to train the data normally, for now I use the old version first to solve my problem, if you have time to look hard why the new version is not supported.
i have same question,how to solve?
Hey @zhaoawen
As a first guess, this looks rather like a torch import problem to me than an issue with nnUNet. Could you try updating torch to the newest version or start with a completely new and up to date environment and then run again with the newest nnunet version?
Best,
Max
I'm getting the same error using the most up to date nnUNet. I did a clean env install of the most up to date pytorch with cuda 12.1 and also followed suggestions on (https://github.com/MIC-DKFZ/batchgenerators/issues/23). I get the exact same error reported above.
I could solve this issue by installing triton separately (i.e. by doing pip install triton
)
Hey @zhaoawen
As a first guess, this looks rather like a torch import problem to me than an issue with nnUNet. Could you try updating torch to the newest version or start with a completely new and up to date environment and then run again with the newest nnunet version?
Best,
Max I ran into the same problem, and it worked fine on older versions, but the latest version showed triton was not installed, but I couldn't install triton on windows
Hey! I encountered a very similar issue-- my Triton error message said : _"torch.dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Triton Error [CUDA]: device kernel image is invalid" I verified that I could connect and use CUDA but was still receiving this message. I followed zhaoawen's advice and uninstalled the latest version nnUnet and installed nnUNetv2-2.3.1 and it worked! So far my model is training (fingers crossed it stays that way).
git clone https://github.com/MIC-DKFZ/nnUNet.git cd nnUNet git checkout tags/v2.3.1 -b version-2.3.1-branch
installed nnUNetv2-2.3.1,can solve
As with above issues, the problem occurs with my nnUNet version 2.5. Previous versions of my nnUNet project that are running separately from the new nnUNet installation are running on version 2.3.1. These are running on identical servers and are still running fine.
hey @naga-karthik
I could solve this issue by installing triton separately (i.e. by doing pip install triton)
Sadly this does not work for windows, as triton is only supported on linux.
Please install pytorch as specified in the installation instructions. Please use the most recent version with the highest available version of CUDA. I recommend using a conda environment. Triton will be automatically installed if you do it this way.
The reason you are encountering issues with triton is that with v2.4 we enable torch.compile by default. Depending on the GPU this will result in large speed-ups during training. 10-30%. So it's definitely worth it.
If you want to disable torch.compile in nnU-Net, just export nnUNet_compile=f
or do nnUNet_compile=f nnUNetv2_train [...]
Best,
Fabian
export nnUNet_compile=f
it solved it. Thanks
This does not work for Windows 10 with Anaconda power shell:
export nnUNet_compile=f
export : Die Benennung "export" wurde nicht als Name eines Cmdlet, einer Funktion, einer Skriptdatei oder eines ausführbaren Programms erkannt. Überprüfen Sie die Schreibweise des Namens, oder ob der Pfad
korrekt ist (sofern enthalten), und wiederholen Sie den Vorgang.
In Zeile:1 Zeichen:1
+ export nnUNet_compile=f
+ ~~~~~~
+ CategoryInfo : ObjectNotFound: (export:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException
Update:
I updated CUDA to 12.1, recreated the conda environment with python 3.11.9, installed nnunetv2 with pip install nnunetv2
, but when I run the training it says RuntimeError: Cannot find a working triton installation.
and when I try to run export nnUNet_compile=f
I get the same error message as already mentioned.
Update 2: Okay now I understood. It is an environment variable that we have to set. On Anaconda Powershall for Windows 10 this would be
conda env config vars set nnUNet_compile=f
Then it works