convnet-benchmarks Titan X thermo behavior might cause performance fluctuation

TL;DR: when you get hot, you run slowly. And it's a bit hard to predict when you get hot enough...

Now the long version: In the last few days we were confused by fluctuations we observe in running exactly the same benchmark code, so I think it'll probably be worth sharing here.

Basically, when you run a benchmark long enough to make the GPU really hot, Titan X seems to be running at full speed for a while, and then throttles down significantly to make the temperature stable at 84 degrees. You can observe this behavior by running nvidia-smi continuously and observe the number of watts it draws. (If you are interested, on two of our Titan Xs, we did not observe the fan to go 100% before throttling down the power. We did not force the GPU to run on P0.)

The implication is that a long enough burn-in period is necessary to get a stable speed number. If the burn-in is not long enough, a lot of factors may affect the final reported speed: (1) how many iterations you run (later iterations may get increasingly slower); (2) whether you ran something immediately before the benchmark (so the GPU has not cooled down yet); (3) whether you are in Reykjavik (we did not test this).

Empirically, it seems that to get a stable number, burn-in periods as long as a few hundred iterations and/or tens of seconds is necessary, which is often longer than what one would expect. For example, in Caffe I only did burn-in for one iteration so the framework can make all memory allocations.

We have observed a fluctuation of about 10% between a cold GPU and a hot GPU - maybe non-trivial enough to be careful about. Just my little observation that you might find useful.

Dec 03 '15 17:12 Yangqing

If you're looking to benchmark performance of a long running kernel, turning off the boost clock will give you the most accurate results:

sudo nvidia-smi -i 1 --auto-boost-default=0

You can also adjust clocks more directly with:

sudo nvidia-smi -i 1 -ac 3505,1392
sudo nvidia-smi -i 1 -ac 3505,1000

This lets you run cuda at the full memory clock (with and without boost), but I'm not sure if that's wise without ECC.

But benchmarking with autoboost enabled is still useful as you can see exactly where the kernel becomes power limited. The factor most important for power limits is the amount of DDR access. So the more you can keep data in L2 or below, the less power you'll draw (and the easier the chip will be to cool).

I like to set the power limit at 275 with full clocks and benchmark while running this:

sudo nvidia-smi -i 1 -pl 275
nvidia-smi -i 1 --loop-ms=333 --format=csv,noheader --query-gpu=power.draw,clocks.gr,temperature.gpu,fan.speed,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown

This gives me a really good sense of the power profile of the kernel.

Dec 03 '15 20:12 scott-gray

I like to set the power limit at 275

But Titan-X doesn't allow setting power limits (275W is default). The only way I found to meaningfully benchmark Titan-X, is to watch temperature and if it reaches 84 degrees, crank up fan manually (through a GUI, command-line has a bug and doesn't work), since stock BIOS inexplicably wouldn't do it. A hacked BIOS is probably the way to go for multiple Titan-X cards in server(s).

Overclocking my Titan-X, I was able to get 10-14% faster timings that reported by Soumith. Without overclocking, I reproduce his timings to within 1%.

Dec 07 '15 20:12 ozabluda

BTW, see commit https://github.com/soumith/convnet-benchmarks/commit/d6177f97e61da0d98a528f355086eb2fc05fe7b8 for how Soumith is doing warmup for both nervana and cudnn and its effect

Dec 07 '15 20:12 ozabluda

The thermal behavior on these GPUs is very very interesting.

While benchmarking, I constantly monitor the card to make sure it is at a stable clock rate (and the same clock all across).

Some more interesting things that one would want to know are that:

Nervana's Neon kernels cant sustain boost clock over training time. They actually clock down if you run them for long enough. These kernels push the GPU to an absolute extreme.
CuDNN Kernels dont push the GPUs this hard overall. They went for lower power draw + FFT instead, to get the same performance.
There seem to be power and speed optimizations depending on special cases of zero. If you send in a uniformly distributed input, it will run slightly slower than an all-zero input. @scott-gray observed the same. This is quite interesting, especially more in the context of ReLU nets.

Also, a quote from @scott-gray while we were discussing the benchmarks in an email thread (I wanted to make sure I was doing things right).

The GPU has a very active power sensor and dynamically changes the clock depending on power draw (independent of temperature). This happens on the millisecond time scale. My fprop and bprop kernels run at 7.2 TFlops when the input is all 1 ones (or any other low entropy data). Switching to random data they top out at 6.6 Tflops or so. One of the reasons that fp16 is faster (aside from reduced bandwidth) is that after converting to fp32 for compute only 10 bits of the mantissa are populated. You can compare the difference if you truncate the inputs to fp16 and then convert back again to fp32 prior to sending the data to the fp32 kernels.

Dec 07 '15 21:12 soumith

There seem to be power and speed optimizations depending on special cases of zero.

very interesting. This must be in hardware. I wonder if the hardware automatically uses less power when many bits are zero, or it's an explicit hardware optimization. Maybe all mantissa bits must be zero for that?

My fprop and bprop kernels run at 7.2 TFlops when the input is all 1 ones (or any other low entropy data).

Really, any other low entropy data? Or data with lots (or even all) zero bits? What if bits are all 1? Maybe it'll be even lower than random?

Switching to random data they top out at 6.6 Tflops or so.

Dec 07 '15 21:12 ozabluda

FWIW, the following shows data-dependent (all-zero vs all-one) 7% difference of power consumption on integer matrix multiplication on AVR. Titan-X is likely to have the same effect, IMO, even without explicit hardware optimizations, if they exist at all.

"Data dependent energy modelling for worst case energy consumption analysis" (2015) Pallister et al http://arxiv.org/abs/1505.03374

Dec 07 '15 23:12 ozabluda

My TitanX defaults to a power limit of 250. For power draw it's the toggling of bits that matters (particularly over long wires). So all ones will be almost as fast as all zeros. It's more random data patterns that draw the most power.

One thing I want to point out is that I'll have a completely new set of kernels out soonish, and these do a much better job of keeping data in L2 and using larger tiles when possible. This keeps the power levels significantly lower allowing the clock to run at full boost. I'll also have everything working at small minibatches across the board. This should make them much easier to scale with multiple gpus.

Dec 08 '15 05:12 scott-gray

My TitanX defaults to a power limit of 250. For power draw it's the toggling of bits that matters (particularly over long wires). So all ones will be almost as fast as all zeros. It's more random data patterns that draw the most power.

You mean, because all 1s, or all 0s, will basically just run dc along the cables, but alternating 1s and 0s will start radiating em-radiation?

Dec 08 '15 13:12 hughperkins

I'm not an expert on these matters, but it's clear the more things are toggling on the chip the more power it draws. It's also possible that additional power is saved when an all zero condition is met. Portions of the logic might be dynamically disabled. No idea if the gpu does this.

Dec 08 '15 16:12 scott-gray

One thing I want to point out is that I'll have a completely new set of kernels out soonish, and these do a much better job of keeping data in L2 and using larger tiles when possible. This keeps the power levels significantly lower allowing the clock to run at full boost. I'll also have everything working at small minibatches across the board. This should make them much easier to scale with multiple gpus.

Super awesome.

My TitanX defaults to a power limit of 250.

Oops, sorry, my mistake. 250W is the default, can be raised to max 275W. Here is how I overclock my Titan-X:

#increase “application clock” and power nvidia-smi -i 0 -pm 1 nvidia-smi -i 0 -ac 3505,1392 −−power−limit=275

#enable coolbits user@dnn1:~$ cat /etc/X11/xorg.conf […] Section "Device" Identifier "nvidia" Driver "nvidia" BusID "PCI:1@0:0:0" Option "ConstrainCursor" "off" Option "Coolbits" "31" EndSection […]

#set PowerMizer mode to “Prefer Maximum performance” DISPLAY=:0 nvidia-settings -a [gpu:0]/GPUPowerMizerMode=1

#overvolt DISPLAY=:0 nvidia-settings -a [gpu:0]/GPUOverVoltageOffset=112399

#from nvidia-settings GUI (can’t do it from command-line) set “Graphics clock Offset” to 300 Mhz, “Memory Transfer Rate Offset” to 800

Dec 08 '15 20:12 ozabluda

Thanks Scott Gray and Soumith for sharing the info. It is very interesting H/W behavior.

On Mon, Dec 7, 2015 at 9:11 PM, Scott Gray [email protected] wrote:

My TitanX defaults to a power limit of 250. For power draw it's the toggling of bits that matters (particularly over long wires). So all ones will be almost as fast as all zeros. It's more random data patterns that draw the most power.

One thing I want to point out is that I'll have a completely new set of kernels out soonish, and these do a much better job of keeping data in L2 and using larger tiles when possible. This keeps the power levels significantly lower allowing the clock to run at full boost. I'll also have everything working at small minibatches across the board. This should make them much easier to scale with multiple gpus.

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/71#issuecomment-162768012 .

Junli Gu--谷俊丽 Coordinated Science Lab University of Illinois at Urbana-Champaign

Dec 08 '15 20:12 gujunli

t's clear the more things are toggling on the chip the more power it draws. It's also possible that additional power is saved when an all zero condition is met. Portions of the logic might be dynamically disabled. No idea if they gpu does this.

I don't know what NVIDIA does, but a chip can indeed detect 0.0, 1.0, etc, situation and turn portions off, like described here: https://en.wikipedia.org/wiki/Clock_gating

Dec 08 '15 20:12 ozabluda

Presumably we need to water cool the GPUs, so they can dissipate 250-300 watts under continuous operation? What is common practice for doing this?

Feb 08 '16 15:02 hughperkins

In a workstation, it's a standard thing you can find in the gamers' forums and magazines, for example, see http://www.maximumpc.com/a-beginners-guide-to-liquid-cooling/. In a standard rack-mounted server, there is no room for it. NVIDIA DIGITS workstation is air-cooled.

Feb 08 '16 18:02 ozabluda

Nice link and pictures. Thanks! :-)

Feb 08 '16 19:02 hughperkins

Out of curiosity, I had a play with looking at the effect of load on the nimbix instances. I ran https://gist.github.com/hughperkins/6194efd67ad7fcbf5678b1285cc45327 with no arguments (except -gpu 1 for one of the gpus) on a dual Titan X insance, with one process on one gpu, and the other on the other. It just runs vgg model 'a' forward, with a batchsize of 128, on cudnnv4, with cudnn.fastest = true set.

When cold, the forward time was ~0.524

After running for ~10 minutes or so:

GPU 1 was stable at 67C, forward time 0.539
GPU 2 was stable at 75C, forward time 0.535

In other words:

the difference in perf, on these GPUs, between cold and hot was ~2.8% for GPU1, and ~2% for GPU2, which seems fairly small, compared to the differences in benchmark resluts that we're mostly concerned with
these GPUs are running pretty cold, nowhere near 85 celsius

nimbixnvidiasmi

May 28 '16 17:05 hughperkins

convnet-benchmarks convnet-benchmarks copied to clipboard

Titan X thermo behavior might cause performance fluctuation

convnet-benchmarks
convnet-benchmarks copied to clipboard