cudnn.torch
cudnn.torch copied to clipboard
OS level crashes when using V3 or V4 and Titan X
When cudnn.benchmark == true and if I FPROP a large network, the cudnn benchmark routine crashes my OS and ubuntu just restarts... No warning, no BSOD, it just restarts.
This does not seem to happen on a Titan Z, just the newer architecture.
Has anyone else seen this? I thought upgrading to cudnn V4 would fix it. At the moment this is blocking development and I can't seem to find a solution.
this seems like a driver / thermal level (or power supply) issue. Can you try to limit the wattage of the card via "nvidia-smi -pl " and see if it still happens?
Also, check the kernel logs in /var/log and see what happened right before restart, prob provides hints.
Nothing in kern.log... Any other logs I should look in?
nvidia-smi -i 0 -pl 275 is the max value I can do... Really don't think it's a hardware issue. As soon as I disable benchmark I no longer get this crash. I really seems like a bug in the cudnn library...
I think you should do min-value rather than max-value. Also, what's the hardware and can you provide a small test case? I'll try to repro on some hardware on my side and/or send it to NVIDIA
OK, I'll put together a stand-alone script tomorrow... I can't share the code directly for obvious reasons :-)
FYI: min-value didn't work either.
are you saying that you cant share google's entire repository with the world to reproduce a bug? :p
Yeah, that might get me into just a tiny bit of trouble :-)
Alternatively, if you set cudnn.benchmark = false, what's the expected performance penalty? I'd profile myself except benchmark = true crashes... If benchmark == false, can you still use cudnn.fastest == true or does that perform profiling as well?
cudnn.fastest uses inbuilt heuristics to determine the fastest algorithm. These heuristics aren't very good if you deviate from imagenet-style networks.
OK, but then what does it do when fastest == false and benchmark == false? It chooses the slowest implementation ;-) Just confused, I would have thought one was a fallback for the other, so I don't know why there are 2 booleans --> Which results in 4 modes. Whereas in the cudnn docs it says "two globally available modes": hence my confusion.
fastest = false has a limit on the workspace size i think. let me check.
yea so default is to choose an algo without too much workspace: https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua#L170-L172
I can make fastest = true by default tbh, because now we use workspaces shared across all layers -- so this doesn't add a lot to memory consumption.
I'm having this same problem, using a Docker image (here) with various large networks distributed in series across 2 or 3 Pascal Titan Xs.
My observations:
Without cudnn: Works fine, takes a while to train (~500s/epoch)
With cudnn:
cudnn.benchmark = true (with both cudnn.fastest = {true,false}): Trains fast but after 0 to a few iterations (~120s/epoch), then hard OS reboot. No warnings.
cudnn.benchmark = true, nvidia-smi --power-limit=200
(down from 250W): Hard OS reboot. No warnings.
cudnn.benchmark = false: Trains medium fast (~350s/epoch). Seems to work fine.
cudnn.benchmark = false, cudnn.fastest = true: Hard OS reboot. No warnings.
hard OS reboots are likely because of thermal issues or other overheating issues. you prob only see this happen with cudnn because it pushes the GPU harder than without cudnn.
Attached is a screenshot of "watch nvidia-smi" at the time of a crash. The temps are all within normal range.
Temp, is one factor. You might also be getting dips on your 12V rail due to low quality PSU or if you've connected all your PCIe power connections to the same 12V rail. You can check it with an oscilloscope.
With that said, I was seeing this with a high quality PSU, but only on a Titan Z card. I think there's something fundamentally broken with Titan Z
- Nvidia drivers + CUDA.
On Wed, Feb 15, 2017 at 12:04 PM, Gregory Johnson [email protected] wrote:
Attached is a screenshot of "watch nvidia-smi" at the time of a crash. The temps are all within normal range.
[image: screen shot 2017-02-15 at 12 03 16 pm] https://cloud.githubusercontent.com/assets/17319655/22993148/ccbb48be-f376-11e6-81e6-847f1aac115d.png
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/soumith/cudnn.torch/issues/154#issuecomment-280122935, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSvwzWSFwntYqkqGZBEI7xlOW3loHkEks5rc1pcgaJpZM4H6VNX .