cudnn.torch OS level crashes when using V3 or V4 and Titan X

When cudnn.benchmark == true and if I FPROP a large network, the cudnn benchmark routine crashes my OS and ubuntu just restarts... No warning, no BSOD, it just restarts.

This does not seem to happen on a Titan Z, just the newer architecture.

Has anyone else seen this? I thought upgrading to cudnn V4 would fix it. At the moment this is blocking development and I can't seem to find a solution.

Mar 29 '16 01:03 jonathantompson

this seems like a driver / thermal level (or power supply) issue. Can you try to limit the wattage of the card via "nvidia-smi -pl " and see if it still happens?

Mar 29 '16 01:03 soumith

Also, check the kernel logs in /var/log and see what happened right before restart, prob provides hints.

Mar 29 '16 01:03 soumith

Nothing in kern.log... Any other logs I should look in?

nvidia-smi -i 0 -pl 275 is the max value I can do... Really don't think it's a hardware issue. As soon as I disable benchmark I no longer get this crash. I really seems like a bug in the cudnn library...

Mar 29 '16 01:03 jonathantompson

I think you should do min-value rather than max-value. Also, what's the hardware and can you provide a small test case? I'll try to repro on some hardware on my side and/or send it to NVIDIA

Mar 29 '16 01:03 soumith

OK, I'll put together a stand-alone script tomorrow... I can't share the code directly for obvious reasons :-)

FYI: min-value didn't work either.

Mar 29 '16 01:03 jonathantompson

are you saying that you cant share google's entire repository with the world to reproduce a bug? :p

Mar 29 '16 01:03 soumith

Yeah, that might get me into just a tiny bit of trouble :-)

Alternatively, if you set cudnn.benchmark = false, what's the expected performance penalty? I'd profile myself except benchmark = true crashes... If benchmark == false, can you still use cudnn.fastest == true or does that perform profiling as well?

Mar 29 '16 01:03 jonathantompson

cudnn.fastest uses inbuilt heuristics to determine the fastest algorithm. These heuristics aren't very good if you deviate from imagenet-style networks.

Mar 29 '16 02:03 soumith

OK, but then what does it do when fastest == false and benchmark == false? It chooses the slowest implementation ;-) Just confused, I would have thought one was a fallback for the other, so I don't know why there are 2 booleans --> Which results in 4 modes. Whereas in the cudnn docs it says "two globally available modes": hence my confusion.

Mar 29 '16 02:03 jonathantompson

fastest = false has a limit on the workspace size i think. let me check.

Mar 29 '16 02:03 soumith

yea so default is to choose an algo without too much workspace: https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua#L170-L172

I can make fastest = true by default tbh, because now we use workspaces shared across all layers -- so this doesn't add a lot to memory consumption.

Mar 29 '16 02:03 soumith

I'm having this same problem, using a Docker image (here) with various large networks distributed in series across 2 or 3 Pascal Titan Xs.

My observations: Without cudnn: Works fine, takes a while to train (~500s/epoch) With cudnn: cudnn.benchmark = true (with both cudnn.fastest = {true,false}): Trains fast but after 0 to a few iterations (~120s/epoch), then hard OS reboot. No warnings. cudnn.benchmark = true, nvidia-smi --power-limit=200 (down from 250W): Hard OS reboot. No warnings. cudnn.benchmark = false: Trains medium fast (~350s/epoch). Seems to work fine. cudnn.benchmark = false, cudnn.fastest = true: Hard OS reboot. No warnings.

Feb 15 '17 19:02 gregjohnso

hard OS reboots are likely because of thermal issues or other overheating issues. you prob only see this happen with cudnn because it pushes the GPU harder than without cudnn.

Feb 15 '17 19:02 soumith

Attached is a screenshot of "watch nvidia-smi" at the time of a crash. The temps are all within normal range.

screen shot 2017-02-15 at 12 03 16 pm

Feb 15 '17 20:02 gregjohnso

Temp, is one factor. You might also be getting dips on your 12V rail due to low quality PSU or if you've connected all your PCIe power connections to the same 12V rail. You can check it with an oscilloscope.

With that said, I was seeing this with a high quality PSU, but only on a Titan Z card. I think there's something fundamentally broken with Titan Z

Nvidia drivers + CUDA.

On Wed, Feb 15, 2017 at 12:04 PM, Gregory Johnson [email protected] wrote:

Attached is a screenshot of "watch nvidia-smi" at the time of a crash. The temps are all within normal range.

[image: screen shot 2017-02-15 at 12 03 16 pm] https://cloud.githubusercontent.com/assets/17319655/22993148/ccbb48be-f376-11e6-81e6-847f1aac115d.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/soumith/cudnn.torch/issues/154#issuecomment-280122935, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSvwzWSFwntYqkqGZBEI7xlOW3loHkEks5rc1pcgaJpZM4H6VNX .

Feb 15 '17 20:02 jonathantompson

cudnn.torch cudnn.torch copied to clipboard

OS level crashes when using V3 or V4 and Titan X

cudnn.torch
cudnn.torch copied to clipboard