maxDNN
maxDNN copied to clipboard
3 tests fail ... CUDA toolkit 7.0, nvidia driver 346.72
no test: 1 convolution iterations: 10 Using GPU device 0 Running test convolve_maxdnn_alexnet_conv1 Running test convolve_cudnn_alexnet_conv1 Running test convolve_maxdnn_alexnet_conv2 Running test convolve_cudnn_alexnet_conv2 Running test convolve_maxdnn_alexnet_conv3 Running test convolve_cudnn_alexnet_conv3 Running test convolve_maxdnn_alexnet_conv4 Running test convolve_cudnn_alexnet_conv4 Running test convolve_maxdnn_alexnet_conv5 Running test convolve_cudnn_alexnet_conv5 Running test convolve_maxdnn_convnet_benchmarks_L1 Running test convolve_cudnn_convnet_benchmarks_L1 Running test convolve_maxdnn_convnet_benchmarks_L2 Running test convolve_cudnn_convnet_benchmarks_L2 conv_test.cpp:416:1: error: Failure in convolve_cudnn_convnet_benchmarks_L2: Unhandled exception: Failed to allocate 18446744073443213312 bytes from GPU Running test convolve_maxdnn_convnet_benchmarks_L3 Running test convolve_cudnn_convnet_benchmarks_L3 conv_test.cpp:426:1: error: Failure in convolve_cudnn_convnet_benchmarks_L3: Unhandled exception: Failed to allocate 18446744072472231936 bytes from GPU Running test convolve_maxdnn_convnet_benchmarks_L4 Running test convolve_cudnn_convnet_benchmarks_L4 conv_test.cpp:436:1: error: Failure in convolve_cudnn_convnet_benchmarks_L4: Unhandled exception: CUDNN_STATUS_EXECUTION_FAILED Running test convolve_maxdnn_convnet_benchmarks_L5 Running test convolve_cudnn_convnet_benchmarks_L5 Running test convolve_maxdnn_overfeat_L1 Running test convolve_cudnn_overfeat_L1 Running test convolve_maxdnn_overfeat_L2 Running test convolve_cudnn_overfeat_L2 Running test convolve_maxdnn_overfeat_L3 Running test convolve_cudnn_overfeat_L3 Running test convolve_maxdnn_overfeat_L4 Running test convolve_cudnn_overfeat_L4 Running test convolve_maxdnn_overfeat_L5 Running test convolve_cudnn_overfeat_L5 FAILURE: 3 out of 53 tests failed (3 failures). Test time: 34.62 seconds.
I am still using CUDA 6.5 and Driver version 343.19. Interesting that only the cuDNN tests fail for you.
What graphics card are you using?
I will upgrade to CUDA 7.0 and try to reproduce tomorrow.
Thanks for the report.
I'd try it with CUDNN_CONVOLUTION_FWD_NO_WORKSPACE instead of CUDNN_CONVOLUTION_FWD_PREFER_FASTEST. I have some vague recollection of cuDNN sometimes requesting an unreasonable amount of memory.
-Scott
On Tue, May 26, 2015 at 2:47 AM, Andrew Lavin [email protected] wrote:
I am still using CUDA 6.5 and Driver version 343.19. Interesting that only the cuDNN tests fail for you.
What graphics card are you using?
I will upgrade to CUDA 7.0 and try to reproduce tomorrow.
Thanks for the report.
— Reply to this email directly or view it on GitHub https://github.com/eBay/maxDNN/issues/1#issuecomment-105469920.
Hm, works for me with CUDA 7.0, driver 346.46.
andrew@clive:~/develop/maxDNN/maxdnn$ ldd maxdnn_test.bin
linux-vdso.so.1 => (0x00007fffc83dd000)
libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007fa04fda7000)
libcudart.so.7.0 => /usr/local/cuda/lib64/libcudart.so.7.0 (0x00007fa04fb49000)
libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fa04f940000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa04f73c000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa04f51d000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa04f219000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa04ef13000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa04ecfc000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa04e937000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa04e72f000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa050c54000)
andrew@clive:~/develop/maxDNN/maxdnn$ nvidia-smi
Tue May 26 02:59:48 2015
+------------------------------------------------------+
| NVIDIA-SMI 346.46 Driver Version: 346.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Off | 0000:01:00.0 N/A | N/A |
The unreasonable amount of memory in this case being size_t(-1) :-)
Let's follow Scott's advice and try CUDNN_CONVOLUTION_FWD_NO_WORKSPACE by commenting out line convolution_cudnn.cpp:90 and uncommenting the next line.
OK, will try that today. I have gtx 860m with 4gb, makes sense size_t(-1) would fail with just 4gb so I assume you have > 4gb? On May 26, 2015 6:19 AM, "Andrew Lavin" [email protected] wrote:
The unreasonable amount of memory in this case being size_t(-1) :-)
Let's follow Scott's advice and try CUDNN_CONVOLUTION_FWD_NO_WORKSPACE by commenting out line convolution_cudnn.cpp:90 and uncommenting the next line.
— Reply to this email directly or view it on GitHub https://github.com/eBay/maxDNN/issues/1#issuecomment-105479079.
success: no test: 1 convolution iterations: 10 Using GPU device 0 Running test convolve_maxdnn_alexnet_conv1 Running test convolve_cudnn_alexnet_conv1 Running test convolve_maxdnn_alexnet_conv2 Running test convolve_cudnn_alexnet_conv2 Running test convolve_maxdnn_alexnet_conv3 Running test convolve_cudnn_alexnet_conv3 Running test convolve_maxdnn_alexnet_conv4 Running test convolve_cudnn_alexnet_conv4 Running test convolve_maxdnn_alexnet_conv5 Running test convolve_cudnn_alexnet_conv5 Running test convolve_maxdnn_convnet_benchmarks_L1 Running test convolve_cudnn_convnet_benchmarks_L1 Running test convolve_maxdnn_convnet_benchmarks_L2 Running test convolve_cudnn_convnet_benchmarks_L2 Running test convolve_maxdnn_convnet_benchmarks_L3 Running test convolve_cudnn_convnet_benchmarks_L3 Running test convolve_maxdnn_convnet_benchmarks_L4 Running test convolve_cudnn_convnet_benchmarks_L4 Running test convolve_maxdnn_convnet_benchmarks_L5 Running test convolve_cudnn_convnet_benchmarks_L5 Running test convolve_maxdnn_overfeat_L1 Running test convolve_cudnn_overfeat_L1 Running test convolve_maxdnn_overfeat_L2 Running test convolve_cudnn_overfeat_L2 Running test convolve_maxdnn_overfeat_L3 Running test convolve_cudnn_overfeat_L3 Running test convolve_maxdnn_overfeat_L4 Running test convolve_cudnn_overfeat_L4 Running test convolve_maxdnn_overfeat_L5 Running test convolve_cudnn_overfeat_L5 Success: 53 tests passed. Test time: 46.82 seconds.
On Tue, May 26, 2015 at 7:42 AM, Robert Wagner [email protected] wrote:
OK, will try that today. I have gtx 860m with 4gb, makes sense size_t(-1) would fail with just 4gb so I assume you have > 4gb? On May 26, 2015 6:19 AM, "Andrew Lavin" [email protected] wrote:
The unreasonable amount of memory in this case being size_t(-1) :-)
Let's follow Scott's advice and try CUDNN_CONVOLUTION_FWD_NO_WORKSPACE by commenting out line convolution_cudnn.cpp:90 and uncommenting the next line.
— Reply to this email directly or view it on GitHub https://github.com/eBay/maxDNN/issues/1#issuecomment-105479079.
My gtx980 is 4gb, but we have different maxwell chips. Yours is gm107 I believe while mine is gm204. Thanks for giving me test results for this gpu.
So it looks like we found a bug in cudnn's error handling. The workaround would be to check the workspace size, and if it equals (size_t)-1, then fall back to the no workspace algorithm.
I will put a fix together today. In the mean time you can just leave the fast algorithm disabled.
I created a fix in my development branch and issued the above pull request. Can somebody with a GM107 verify?
By the way, the insane workspace size is not exactly (size_t)-1, so I just set a default max workspace size which you can override by setting environment variable maxdnn_max_workspace_size.
I can verify tomorrow on the GM107
On Tue, May 26, 2015 at 6:06 PM, Andrew Lavin [email protected] wrote:
I created a fix in my development branch and issued the above pull request. Can somebody with a GM107 verify?
By the way, the insane workspace size is not exactly (size_t)-1, so I just set a default max workspace size which you can override by setting environment variable maxdnn_max_workspace_size.
— Reply to this email directly or view it on GitHub https://github.com/eBay/maxDNN/issues/1#issuecomment-105681110.
I will leave this issue open until my pull request is accepted. In the mean time, anybody who is blocked by this bug can use maxDNN from my fork, here: https://github.com/andravin/maxDNN