FairMOT icon indicating copy to clipboard operation
FairMOT copied to clipboard

RuntimeError: Caught RuntimeError in replica 0 on device 0

Open tdchua opened this issue 3 years ago • 0 comments

I am encountering this problem when I run the training command in the terminal: sh experiments/crowdhuman_dla34.sh

Traceback (most recent call last):
  File "train.py", line 98, in <module>
    main(opt)
  File "train.py", line 69, in main
    log_dict_train, _ = trainer.train(epoch, train_loader)
  File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/trains/base_trainer.py", line 119, in train
    return self.run_epoch('train', epoch, data_loader)
  File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/trains/base_trainer.py", line 71, in run_epoch
    output, loss, loss_stats = model_with_loss(batch)
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/trains/base_trainer.py", line 19, in forward
    outputs = self.model(batch['input'])
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/models/networks/pose_dla_dcn.py", line 472, in forward
    x = self.dla_up(x)
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/models/networks/pose_dla_dcn.py", line 411, in forward
    ida(layers, len(layers) -i - 2, len(layers))
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/canzone/Desktop/research/repositories/FairMOT/src/lib/models/networks/pose_dla_dcn.py", line 384, in forward
    layers[i] = upsample(project(layers[i]))
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/canzone/anaconda3/envs/FairMOT/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 927, in forward
    return F.conv_transpose2d(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

Has anyone a solution for this issue? Thanks!

tdchua avatar Oct 06 '21 07:10 tdchua