FairMOT
FairMOT copied to clipboard
RuntimeError: Caught RuntimeError in replica 0 on device 0
I am encountering this problem when I run the training command in the terminal: sh experiments/crowdhuman_dla34.sh
Traceback (most recent call last):
File "train.py", line 98, in <module>
main(opt)
File "train.py", line 69, in main
log_dict_train, _ = trainer.train(epoch, train_loader)
File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/trains/base_trainer.py", line 119, in train
return self.run_epoch('train', epoch, data_loader)
File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/trains/base_trainer.py", line 71, in run_epoch
output, loss, loss_stats = model_with_loss(batch)
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/trains/base_trainer.py", line 19, in forward
outputs = self.model(batch['input'])
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/models/networks/pose_dla_dcn.py", line 472, in forward
x = self.dla_up(x)
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/models/networks/pose_dla_dcn.py", line 411, in forward
ida(layers, len(layers) -i - 2, len(layers))
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/models/networks/pose_dla_dcn.py", line 384, in forward
layers[i] = upsample(project(layers[i]))
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 927, in forward
return F.conv_transpose2d(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
Has anyone a solution for this issue? Thanks!