topaz
topaz copied to clipboard
Running Topaz on HPC got RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Dear Topaz Community,
I hope this message finds you well.
I wanted to share an update regarding my usage of Topaz within Relion 5 on our HPC server. While utilizing the module command to load Relion 5, I encountered an issue as Topaz wasn't preinstalled. To address this, I followed the advice given to install Topaz under my home directory, adhering closely to the installation instructions provided.
However, upon attempting to execute a Topaz picking or training job, I encountered an error message when specifying the Topaz location as ~/.conda/envs/topaz/bin/topaz:
+ Will use topaz for training a model
+ Written out list of input training coordinates: AutoPick/job016/input_training_coords.star
+ Setting topaz downscale factor to 15 (assuming resnet8 model and 2*particle_diameter receptive box)
+ Setting topaz radius to 5 downscaled pixels (based on 25% of particle_diameter/2)
+ Using GPU device 0
+ Training with 79 picks in test set; and 315 picks in work set
+ By setting aside 4 micrographs for the test set
# Loading model: resnet8
# Model parameters: units=32, dropout=0.0, bn=on
# Loading pretrained model: resnet8_u32
# Receptive field: 71
# Using device=0 with cuda=True
# Loaded 20 training micrographs with 315 labeled particles
# Loaded 4 test micrographs with 79 labeled particles
# source split p_observed num_positive_regions total_regions
# 0 train 0.0203 25515 1254760
# 0 test 0.0255 6399 250952
# Specified expected number of particle per micrograph = 40.0
# With radius = 5
# Setting pi = 0.05164334215308107
# minibatch_size=256, epoch_size=1000, num_epochs=10
Traceback (most recent call last):
File "/home/ch1225/.conda/envs/topaz/bin/topaz", line 33, in <module>
sys.exit(load_entry_point('topaz-em==0.2.5', 'console_scripts', 'topaz')())
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/main.py", line 148, in main
args.func(args)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/commands/train.py", line 695, in main
, save_prefix=save_prefix, use_cuda=use_cuda, output=output)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/commands/train.py", line 577, in fit_epochs
, use_cuda=use_cuda, output=output)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/commands/train.py", line 557, in fit_epoch
metrics = step_method.step(X, Y)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/methods.py", line 103, in step
score = self.model(X).view(-1)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/model/classifier.py", line 28, in forward
z = self.features(x)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/model/features/resnet.py", line 54, in forward
z = self.features(x)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/model/features/resnet.py", line 270, in forward
y = self.conv(x)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Is this more about communication with HPC and GPU nodes?
Any input is much appreciated!
Best regards,
David
This looks like a possible GPU RAM issue. Sometimes CUDA gives weird errors like this when it runs out of GPU RAM. Is anything else running on the GPU at the same time?
Closing this issue since it hasn't had any more discussion. @CFDavidHou feel free to reopen it if there is more to discuss.