Detectron.pytorch icon indicating copy to clipboard operation
Detectron.pytorch copied to clipboard

Could not train on v100

Open qinhaifangpku opened this issue 6 years ago • 4 comments

PLEASE FOLLOW THESE INSTRUCTIONS BEFORE POSTING

  1. Read the README.md thoroughly ! README.md is not a decoration.
  2. Please search existing open and closed issues in case your issue has already been reported
  3. Please try to debug the issue in case you can solve it on your own before posting

After following steps above and agreeing to provide the detailed information requested below, you may continue with posting your issue

(Delete this line and the text above it.)

Expected results

What did you expect to see?

Actual results

What did you observe instead?

Detailed steps to reproduce

E.g.:

The command that you ran

System information

  • Operating system: Linux
  • CUDA version: 9.0
  • cuDNN version: ?
  • GPU models (for all devices if they are not all the same): ?
  • python version: 3.5
  • pytorch version: ?
  • Anything else that seems relevant: ?

Hi, Thanks for sharing this great work! I want to train mask-rcnn on v100, but it got error like this: cudaCheckError() failed : no kernel image is available for execution on the device THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=29 : driver shutting down Segmentation faultl

Can anyone help me out of this problem?

Thank you in advance!

qinhaifangpku avatar Sep 20 '18 02:09 qinhaifangpku

Please follow these instructions in the compilation section:

If your are using Volta GPUs, uncomment this line in lib/mask.sh and remember to postpend a backslash at the line above. CUDA_PATH defaults to /usr/loca/cuda. If you want to use a CUDA library on different path, change this line accordingly.

santoshmo avatar Sep 29 '18 18:09 santoshmo

Please follow these instructions in the compilation section:

If your are using Volta GPUs, uncomment this line in lib/mask.sh and remember to postpend a backslash at the line above. CUDA_PATH defaults to /usr/loca/cuda. If you want to use a CUDA library on different path, change this line accordingly.

yeah, I have done like you suggested but it will failed the training after some iterations. The memory would not release but the GPU computaion is 0%

qinhaifangpku avatar Sep 30 '18 05:09 qinhaifangpku

@qinhaifangpku Could you update your nvidia driver?

shinya7y avatar Oct 15 '18 17:10 shinya7y

@qinhaifangpku I'm facing the same problem with you when training the mask-rcnn on V100, too. But such problem has never appeared when using GPU with other frameworks, such as 1080ti. Due to the messages shown after ctrl-c, I think there are some deadlock lock problems when using Volta GPUs. And the project faces some dead-lock problem indeed. We can find some traces from the code, such as https://github.com/roytseng-tw/Detectron.pytorch/blob/8315af319cd29b8884a7c0382c4700a96bf35bbc/tools/train_net_step.py#L18. I do not figure out how to solve the problem fundamentally. Changing Volta GPU into Pascal GPU is the only way I know. Hoping these help you.

Gasoonjia avatar Dec 08 '18 14:12 Gasoonjia