maxDNN icon indicating copy to clipboard operation
maxDNN copied to clipboard

3 tests fail ... CUDA toolkit 7.0, nvidia driver 346.72

Open wolfchimneyrock opened this issue 9 years ago • 10 comments

no test: 1 convolution iterations: 10 Using GPU device 0 Running test convolve_maxdnn_alexnet_conv1 Running test convolve_cudnn_alexnet_conv1 Running test convolve_maxdnn_alexnet_conv2 Running test convolve_cudnn_alexnet_conv2 Running test convolve_maxdnn_alexnet_conv3 Running test convolve_cudnn_alexnet_conv3 Running test convolve_maxdnn_alexnet_conv4 Running test convolve_cudnn_alexnet_conv4 Running test convolve_maxdnn_alexnet_conv5 Running test convolve_cudnn_alexnet_conv5 Running test convolve_maxdnn_convnet_benchmarks_L1 Running test convolve_cudnn_convnet_benchmarks_L1 Running test convolve_maxdnn_convnet_benchmarks_L2 Running test convolve_cudnn_convnet_benchmarks_L2 conv_test.cpp:416:1: error: Failure in convolve_cudnn_convnet_benchmarks_L2: Unhandled exception: Failed to allocate 18446744073443213312 bytes from GPU Running test convolve_maxdnn_convnet_benchmarks_L3 Running test convolve_cudnn_convnet_benchmarks_L3 conv_test.cpp:426:1: error: Failure in convolve_cudnn_convnet_benchmarks_L3: Unhandled exception: Failed to allocate 18446744072472231936 bytes from GPU Running test convolve_maxdnn_convnet_benchmarks_L4 Running test convolve_cudnn_convnet_benchmarks_L4 conv_test.cpp:436:1: error: Failure in convolve_cudnn_convnet_benchmarks_L4: Unhandled exception: CUDNN_STATUS_EXECUTION_FAILED Running test convolve_maxdnn_convnet_benchmarks_L5 Running test convolve_cudnn_convnet_benchmarks_L5 Running test convolve_maxdnn_overfeat_L1 Running test convolve_cudnn_overfeat_L1 Running test convolve_maxdnn_overfeat_L2 Running test convolve_cudnn_overfeat_L2 Running test convolve_maxdnn_overfeat_L3 Running test convolve_cudnn_overfeat_L3 Running test convolve_maxdnn_overfeat_L4 Running test convolve_cudnn_overfeat_L4 Running test convolve_maxdnn_overfeat_L5 Running test convolve_cudnn_overfeat_L5 FAILURE: 3 out of 53 tests failed (3 failures). Test time: 34.62 seconds.

wolfchimneyrock avatar May 26 '15 04:05 wolfchimneyrock

I am still using CUDA 6.5 and Driver version 343.19. Interesting that only the cuDNN tests fail for you.

What graphics card are you using?

I will upgrade to CUDA 7.0 and try to reproduce tomorrow.

Thanks for the report.

andravin avatar May 26 '15 09:05 andravin

I'd try it with CUDNN_CONVOLUTION_FWD_NO_WORKSPACE instead of CUDNN_CONVOLUTION_FWD_PREFER_FASTEST. I have some vague recollection of cuDNN sometimes requesting an unreasonable amount of memory.

-Scott

On Tue, May 26, 2015 at 2:47 AM, Andrew Lavin [email protected] wrote:

I am still using CUDA 6.5 and Driver version 343.19. Interesting that only the cuDNN tests fail for you.

What graphics card are you using?

I will upgrade to CUDA 7.0 and try to reproduce tomorrow.

Thanks for the report.

— Reply to this email directly or view it on GitHub https://github.com/eBay/maxDNN/issues/1#issuecomment-105469920.

scott-gray avatar May 26 '15 10:05 scott-gray

Hm, works for me with CUDA 7.0, driver 346.46.

andrew@clive:~/develop/maxDNN/maxdnn$ ldd maxdnn_test.bin linux-vdso.so.1 => (0x00007fffc83dd000) libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007fa04fda7000) libcudart.so.7.0 => /usr/local/cuda/lib64/libcudart.so.7.0 (0x00007fa04fb49000) libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fa04f940000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa04f73c000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa04f51d000) libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa04f219000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa04ef13000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa04ecfc000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa04e937000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa04e72f000) /lib64/ld-linux-x86-64.so.2 (0x00007fa050c54000) andrew@clive:~/develop/maxDNN/maxdnn$ nvidia-smi Tue May 26 02:59:48 2015
+------------------------------------------------------+
| NVIDIA-SMI 346.46 Driver Version: 346.46 |
|-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 980 Off | 0000:01:00.0 N/A | N/A |

andravin avatar May 26 '15 10:05 andravin

The unreasonable amount of memory in this case being size_t(-1) :-)

Let's follow Scott's advice and try CUDNN_CONVOLUTION_FWD_NO_WORKSPACE by commenting out line convolution_cudnn.cpp:90 and uncommenting the next line.

andravin avatar May 26 '15 10:05 andravin

OK, will try that today. I have gtx 860m with 4gb, makes sense size_t(-1) would fail with just 4gb so I assume you have > 4gb? On May 26, 2015 6:19 AM, "Andrew Lavin" [email protected] wrote:

The unreasonable amount of memory in this case being size_t(-1) :-)

Let's follow Scott's advice and try CUDNN_CONVOLUTION_FWD_NO_WORKSPACE by commenting out line convolution_cudnn.cpp:90 and uncommenting the next line.

— Reply to this email directly or view it on GitHub https://github.com/eBay/maxDNN/issues/1#issuecomment-105479079.

wolfchimneyrock avatar May 26 '15 11:05 wolfchimneyrock

success: no test: 1 convolution iterations: 10 Using GPU device 0 Running test convolve_maxdnn_alexnet_conv1 Running test convolve_cudnn_alexnet_conv1 Running test convolve_maxdnn_alexnet_conv2 Running test convolve_cudnn_alexnet_conv2 Running test convolve_maxdnn_alexnet_conv3 Running test convolve_cudnn_alexnet_conv3 Running test convolve_maxdnn_alexnet_conv4 Running test convolve_cudnn_alexnet_conv4 Running test convolve_maxdnn_alexnet_conv5 Running test convolve_cudnn_alexnet_conv5 Running test convolve_maxdnn_convnet_benchmarks_L1 Running test convolve_cudnn_convnet_benchmarks_L1 Running test convolve_maxdnn_convnet_benchmarks_L2 Running test convolve_cudnn_convnet_benchmarks_L2 Running test convolve_maxdnn_convnet_benchmarks_L3 Running test convolve_cudnn_convnet_benchmarks_L3 Running test convolve_maxdnn_convnet_benchmarks_L4 Running test convolve_cudnn_convnet_benchmarks_L4 Running test convolve_maxdnn_convnet_benchmarks_L5 Running test convolve_cudnn_convnet_benchmarks_L5 Running test convolve_maxdnn_overfeat_L1 Running test convolve_cudnn_overfeat_L1 Running test convolve_maxdnn_overfeat_L2 Running test convolve_cudnn_overfeat_L2 Running test convolve_maxdnn_overfeat_L3 Running test convolve_cudnn_overfeat_L3 Running test convolve_maxdnn_overfeat_L4 Running test convolve_cudnn_overfeat_L4 Running test convolve_maxdnn_overfeat_L5 Running test convolve_cudnn_overfeat_L5 Success: 53 tests passed. Test time: 46.82 seconds.

On Tue, May 26, 2015 at 7:42 AM, Robert Wagner [email protected] wrote:

OK, will try that today. I have gtx 860m with 4gb, makes sense size_t(-1) would fail with just 4gb so I assume you have > 4gb? On May 26, 2015 6:19 AM, "Andrew Lavin" [email protected] wrote:

The unreasonable amount of memory in this case being size_t(-1) :-)

Let's follow Scott's advice and try CUDNN_CONVOLUTION_FWD_NO_WORKSPACE by commenting out line convolution_cudnn.cpp:90 and uncommenting the next line.

— Reply to this email directly or view it on GitHub https://github.com/eBay/maxDNN/issues/1#issuecomment-105479079.

wolfchimneyrock avatar May 26 '15 12:05 wolfchimneyrock

My gtx980 is 4gb, but we have different maxwell chips. Yours is gm107 I believe while mine is gm204. Thanks for giving me test results for this gpu.

So it looks like we found a bug in cudnn's error handling. The workaround would be to check the workspace size, and if it equals (size_t)-1, then fall back to the no workspace algorithm.

I will put a fix together today. In the mean time you can just leave the fast algorithm disabled.

andravin avatar May 26 '15 16:05 andravin

I created a fix in my development branch and issued the above pull request. Can somebody with a GM107 verify?

By the way, the insane workspace size is not exactly (size_t)-1, so I just set a default max workspace size which you can override by setting environment variable maxdnn_max_workspace_size.

andravin avatar May 26 '15 22:05 andravin

I can verify tomorrow on the GM107

On Tue, May 26, 2015 at 6:06 PM, Andrew Lavin [email protected] wrote:

I created a fix in my development branch and issued the above pull request. Can somebody with a GM107 verify?

By the way, the insane workspace size is not exactly (size_t)-1, so I just set a default max workspace size which you can override by setting environment variable maxdnn_max_workspace_size.

— Reply to this email directly or view it on GitHub https://github.com/eBay/maxDNN/issues/1#issuecomment-105681110.

wolfchimneyrock avatar May 27 '15 16:05 wolfchimneyrock

I will leave this issue open until my pull request is accepted. In the mean time, anybody who is blocked by this bug can use maxDNN from my fork, here: https://github.com/andravin/maxDNN

andravin avatar Jun 01 '15 18:06 andravin