cudnn.torch icon indicating copy to clipboard operation
cudnn.torch copied to clipboard

OS level crashes when using V3 or V4 and Titan X

Open jonathantompson opened this issue 8 years ago • 15 comments

When cudnn.benchmark == true and if I FPROP a large network, the cudnn benchmark routine crashes my OS and ubuntu just restarts... No warning, no BSOD, it just restarts.

This does not seem to happen on a Titan Z, just the newer architecture.

Has anyone else seen this? I thought upgrading to cudnn V4 would fix it. At the moment this is blocking development and I can't seem to find a solution.

jonathantompson avatar Mar 29 '16 01:03 jonathantompson

this seems like a driver / thermal level (or power supply) issue. Can you try to limit the wattage of the card via "nvidia-smi -pl " and see if it still happens?

soumith avatar Mar 29 '16 01:03 soumith

Also, check the kernel logs in /var/log and see what happened right before restart, prob provides hints.

soumith avatar Mar 29 '16 01:03 soumith

Nothing in kern.log... Any other logs I should look in?

nvidia-smi -i 0 -pl 275 is the max value I can do... Really don't think it's a hardware issue. As soon as I disable benchmark I no longer get this crash. I really seems like a bug in the cudnn library...

jonathantompson avatar Mar 29 '16 01:03 jonathantompson

I think you should do min-value rather than max-value. Also, what's the hardware and can you provide a small test case? I'll try to repro on some hardware on my side and/or send it to NVIDIA

soumith avatar Mar 29 '16 01:03 soumith

OK, I'll put together a stand-alone script tomorrow... I can't share the code directly for obvious reasons :-)

FYI: min-value didn't work either.

jonathantompson avatar Mar 29 '16 01:03 jonathantompson

are you saying that you cant share google's entire repository with the world to reproduce a bug? :p

soumith avatar Mar 29 '16 01:03 soumith

Yeah, that might get me into just a tiny bit of trouble :-)

Alternatively, if you set cudnn.benchmark = false, what's the expected performance penalty? I'd profile myself except benchmark = true crashes... If benchmark == false, can you still use cudnn.fastest == true or does that perform profiling as well?

jonathantompson avatar Mar 29 '16 01:03 jonathantompson

cudnn.fastest uses inbuilt heuristics to determine the fastest algorithm. These heuristics aren't very good if you deviate from imagenet-style networks.

soumith avatar Mar 29 '16 02:03 soumith

OK, but then what does it do when fastest == false and benchmark == false? It chooses the slowest implementation ;-) Just confused, I would have thought one was a fallback for the other, so I don't know why there are 2 booleans --> Which results in 4 modes. Whereas in the cudnn docs it says "two globally available modes": hence my confusion.

jonathantompson avatar Mar 29 '16 02:03 jonathantompson

fastest = false has a limit on the workspace size i think. let me check.

soumith avatar Mar 29 '16 02:03 soumith

yea so default is to choose an algo without too much workspace: https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua#L170-L172

I can make fastest = true by default tbh, because now we use workspaces shared across all layers -- so this doesn't add a lot to memory consumption.

soumith avatar Mar 29 '16 02:03 soumith

I'm having this same problem, using a Docker image (here) with various large networks distributed in series across 2 or 3 Pascal Titan Xs.

My observations: Without cudnn: Works fine, takes a while to train (~500s/epoch) With cudnn: cudnn.benchmark = true (with both cudnn.fastest = {true,false}): Trains fast but after 0 to a few iterations (~120s/epoch), then hard OS reboot. No warnings. cudnn.benchmark = true, nvidia-smi --power-limit=200 (down from 250W): Hard OS reboot. No warnings. cudnn.benchmark = false: Trains medium fast (~350s/epoch). Seems to work fine. cudnn.benchmark = false, cudnn.fastest = true: Hard OS reboot. No warnings.

gregjohnso avatar Feb 15 '17 19:02 gregjohnso

hard OS reboots are likely because of thermal issues or other overheating issues. you prob only see this happen with cudnn because it pushes the GPU harder than without cudnn.

soumith avatar Feb 15 '17 19:02 soumith

Attached is a screenshot of "watch nvidia-smi" at the time of a crash. The temps are all within normal range.

screen shot 2017-02-15 at 12 03 16 pm

gregjohnso avatar Feb 15 '17 20:02 gregjohnso

Temp, is one factor. You might also be getting dips on your 12V rail due to low quality PSU or if you've connected all your PCIe power connections to the same 12V rail. You can check it with an oscilloscope.

With that said, I was seeing this with a high quality PSU, but only on a Titan Z card. I think there's something fundamentally broken with Titan Z

  • Nvidia drivers + CUDA.

On Wed, Feb 15, 2017 at 12:04 PM, Gregory Johnson [email protected] wrote:

Attached is a screenshot of "watch nvidia-smi" at the time of a crash. The temps are all within normal range.

[image: screen shot 2017-02-15 at 12 03 16 pm] https://cloud.githubusercontent.com/assets/17319655/22993148/ccbb48be-f376-11e6-81e6-847f1aac115d.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/soumith/cudnn.torch/issues/154#issuecomment-280122935, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSvwzWSFwntYqkqGZBEI7xlOW3loHkEks5rc1pcgaJpZM4H6VNX .

jonathantompson avatar Feb 15 '17 20:02 jonathantompson