Detectron.pytorch
Detectron.pytorch copied to clipboard
Could not train on v100
PLEASE FOLLOW THESE INSTRUCTIONS BEFORE POSTING
- Read the README.md thoroughly ! README.md is not a decoration.
- Please search existing open and closed issues in case your issue has already been reported
- Please try to debug the issue in case you can solve it on your own before posting
After following steps above and agreeing to provide the detailed information requested below, you may continue with posting your issue
(Delete this line and the text above it.)
Expected results
What did you expect to see?
Actual results
What did you observe instead?
Detailed steps to reproduce
E.g.:
The command that you ran
System information
- Operating system: Linux
- CUDA version: 9.0
- cuDNN version: ?
- GPU models (for all devices if they are not all the same): ?
- python version: 3.5
- pytorch version: ?
- Anything else that seems relevant: ?
Hi, Thanks for sharing this great work! I want to train mask-rcnn on v100, but it got error like this: cudaCheckError() failed : no kernel image is available for execution on the device THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=29 : driver shutting down Segmentation faultl
Can anyone help me out of this problem?
Thank you in advance!
Please follow these instructions in the compilation section:
If your are using Volta GPUs, uncomment this line in lib/mask.sh and remember to postpend a backslash at the line above. CUDA_PATH defaults to /usr/loca/cuda. If you want to use a CUDA library on different path, change this line accordingly.
Please follow these instructions in the compilation section:
If your are using Volta GPUs, uncomment this line in lib/mask.sh and remember to postpend a backslash at the line above. CUDA_PATH defaults to /usr/loca/cuda. If you want to use a CUDA library on different path, change this line accordingly.
yeah, I have done like you suggested but it will failed the training after some iterations. The memory would not release but the GPU computaion is 0%
@qinhaifangpku Could you update your nvidia driver?
@qinhaifangpku I'm facing the same problem with you when training the mask-rcnn on V100, too. But such problem has never appeared when using GPU with other frameworks, such as 1080ti. Due to the messages shown after ctrl-c, I think there are some deadlock lock problems when using Volta GPUs. And the project faces some dead-lock problem indeed. We can find some traces from the code, such as https://github.com/roytseng-tw/Detectron.pytorch/blob/8315af319cd29b8884a7c0382c4700a96bf35bbc/tools/train_net_step.py#L18. I do not figure out how to solve the problem fundamentally. Changing Volta GPU into Pascal GPU is the only way I know. Hoping these help you.