caffe
caffe copied to clipboard
Intel Beignet spatial convolution OpenCL compile failure
@gongzg Totally stuck here, tried to find out the cause for hours. Any ideas?
- Beignet git master from today (21.07.2016)
- Fedora 24, CLANG-3.8, LLVM-3.8, GCC-6.1.1
- Command
./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=0 -iterations=5 - Compiled with USE_GREENTEA, USE_LIBDNN, USE_INTEL_SPATIAL
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
*** Aborted at 1469222493 (unix time) try "date -d @1469222493" if you are using GNU date ***
PC: @ 0x7fb9004d78b6 llvm::BasicBlock::getTerminator()
*** SIGSEGV (@0x30000024c) received by PID 3242 (TID 0x7fb916044a40) from PID 588; stack trace: ***
@ 0x7fb90fcffc30 (unknown)
@ 0x7fb9004d78b6 llvm::BasicBlock::getTerminator()
@ 0x7fb900425b05 llvm::LoopBase<>::getExitBlocks()
@ 0x7fb900427c65 llvm::Loop::hasDedicatedExits()
@ 0x7fb900427e16 llvm::Loop::getLoopID()
@ 0x7fb8ffde530c gbe::CustomLoopUnroll::GetUnrollMetadataValue()
@ 0x7fb8ffde5e2a gbe::CustomLoopUnroll::runOnLoop()
@ 0x7fb90043100b llvm::LPPassManager::runOnFunction()
@ 0x7fb9005e2528 llvm::FPPassManager::runOnFunction()
@ 0x7fb9003be3d7 (anonymous namespace)::CGPassManager::runOnModule()
@ 0x7fb9005e2bdd llvm::legacy::PassManagerImpl::run()
@ 0x7fb8ffde02fb gbe::runModulePass()
@ 0x7fb8ffde088e gbe::llvmToGen()
@ 0x7fb8ffd38023 gbe::Program::buildFromLLVMFile()
@ 0x7fb8fff3f8b9 gbe::genProgramNewFromLLVM()
@ 0x7fb8ffd3c7b5 gbe::programNewFromSource()
@ 0x7fb900f76509 cl_program_build
@ 0x7fb900f69d98 clBuildProgram
@ 0x7fb915a8524a viennacl::ocl::context::add_program()
@ 0x7fb915a82830 caffe::submit_conv_spatial_program()
@ 0x7fb915b55f28 caffe::ConvolutionLayerSpatial<>::setup_IDLF()
@ 0x7fb915b56a75 caffe::ConvolutionLayerSpatial<>::setup_convolution()
@ 0x7fb915b58a01 caffe::ConvolutionLayerSpatial<>::Forward_gpu()
@ 0x7fb915a51e82 caffe::Net<>::ForwardFromTo()
@ 0x7fb915a51f77 caffe::Net<>::Forward()
@ 0x414d93 time()
@ 0x40ea4e main
@ 0x7fb90f94d731 __libc_start_main
@ 0x40f389 _start
Segmentation fault (core dumped)
@gongzg
Maybe worth noting that ./build/test/test_all.testbin --gtest_filter=*Spatial* 1 passes without errors. But on the actual AlexNet that would be interesting as a benchmark, it fails.
@naibaf7 The error message indicates this is a LLVM related issue. I would suggest to switch to LLVM 3.6 to have a try. If you still have any issue, please let me know.
@naibaf7 Another quick try is to open the file backend/src/llvm/llvm_to_gen.cpp and find the following code, then comment out the "MPM.add(createCustomLoopUnrollPass());" Then have a try with you current llvm version. But this is not recommended. I doubt whether beignet is tested with this LLVM and don't know whether there is any other issue. Anyway, for your refernece.
#if !defined(ANDROID) MPM.add(createCustomLoopUnrollPass()); //1024, 32, 1024, 512)); //Unroll loops #endif
@gongzg Ok good to know. Interesting though that some of the code (as for example the tests) do work. I can't downgrade to llvm-3.6 or llvm-3.7 on Fedora 24 without destroying the graphics driver (X11 won't start anymore for some reason), so I'll try the "dirty fix" instead and hope beignet development will catch up soon.
@gongzg Ok nice, I finally got it working. But the performance doesn't quite make sense to me. Do you know what could have gone wrong now?
Command:
./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=1 -iterations=5
Results (average forward pass time):
- Default convolution engine (ViennaCL-BLAS): 2.2 seconds
- LibDNN convolution engine (unreleased version 2): 1.4 seconds
- Spatial convolution engine (Intel): 4.4 seconds
@naibaf7 did you check the break down performance for each layer? I used to see very bad GEMM performance with either ISAAC or ViennaCL blas, and most of the time is for the convolution backward path and the FC layers. And I found you specified the gpu device 1, do you have more than one OCL device in your system?
@gongzg Yes but except for the forward convolution, all variants use the same kernels on all other layers. Yes, GPU device 0 is the Intel CPU on my system.
- Default convolution engine (clBLAS 2.4): 2.7 seconds, so a bit slower than ViennaCL-BLAS.
I0723 16:36:23.906051 30903 common.cpp:373] Total devices: 2
I0723 16:36:23.906175 30903 common.cpp:374] CUDA devices: 0
I0723 16:36:23.906180 30903 common.cpp:375] OpenCL devices: 2
I0723 16:36:23.906185 30903 common.cpp:399] Device id: 0
I0723 16:36:23.906189 30903 common.cpp:401] Device backend: OpenCL
I0723 16:36:23.906213 30903 common.cpp:403] Backend details: Intel(R) Corporation: OpenCL 1.2 LINUX
I0723 16:36:23.906244 30903 common.cpp:405] Device vendor: Intel(R) Corporation
I0723 16:36:23.906265 30903 common.cpp:407] Name: Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
I0723 16:36:23.906518 30903 common.cpp:409] Total global memory: 7917195264
I0723 16:36:23.906525 30903 common.cpp:399] Device id: 1
I0723 16:36:23.906529 30903 common.cpp:401] Device backend: OpenCL
I0723 16:36:23.906535 30903 common.cpp:403] Backend details: Intel: OpenCL 1.2 beignet 1.2 (git-b55060c)
I0723 16:36:23.906539 30903 common.cpp:405] Device vendor: Intel
I0723 16:36:23.906543 30903 common.cpp:407] Name: Intel(R) HD Graphics Skylake ULT GT3
I0723 16:36:23.906565 30903 common.cpp:409] Total global memory: 3958374400
@naibaf7 could you share the average forward time and backward time. The backward time is really slow. For am example. on my BDW GT2 machine, what I got from benchmark64.prototxt is: average forward pass: 832ms. average backward pass: 3834ms.
I believe libDNN engine should be much faster at backward pass.
@naibaf7 I just did a test on a BDW GT3e machine, get the following performance number with spatial convolution engine: average forward time: 277.8ms average backward time: 1139ms This GPU should be very close to your machine, but I'm using the OpenCL SDK. Will find a SKL GT3 machine next week and use beignet to do some test.
@gongzg That's interesting, so the BDW GT2 is faster than the Skylake? Which engine did you use with the numbers you posted there? ViennaCL BLAS per layer:
I0723 16:53:57.989781 658 caffe.cpp:450] Average time per layer:
I0723 16:53:57.989796 658 caffe.cpp:453] data forward: 0.100127 ms.
I0723 16:53:57.989814 658 caffe.cpp:456] data backward: 0.097255 ms.
I0723 16:53:57.989830 658 caffe.cpp:453] label forward: 0.0986646 ms.
I0723 16:53:57.989845 658 caffe.cpp:456] label backward: 0.116754 ms.
I0723 16:53:57.989859 658 caffe.cpp:453] conv1 forward: 197.391 ms.
I0723 16:53:57.989874 658 caffe.cpp:456] conv1 backward: 339.864 ms.
I0723 16:53:57.989889 658 caffe.cpp:453] relu1 forward: 7.03235 ms.
I0723 16:53:57.989902 658 caffe.cpp:456] relu1 backward: 13.2872 ms.
I0723 16:53:57.989917 658 caffe.cpp:453] norm1 forward: 28.9102 ms.
I0723 16:53:57.989931 658 caffe.cpp:456] norm1 backward: 40.849 ms.
I0723 16:53:57.989946 658 caffe.cpp:453] pool1 forward: 6.65354 ms.
I0723 16:53:57.989959 658 caffe.cpp:456] pool1 backward: 16.7646 ms.
I0723 16:53:57.989974 658 caffe.cpp:453] conv2 forward: 447.242 ms.
I0723 16:53:57.989987 658 caffe.cpp:456] conv2 backward: 833.264 ms.
I0723 16:53:57.990001 658 caffe.cpp:453] relu2 forward: 3.9086 ms.
I0723 16:53:57.990015 658 caffe.cpp:456] relu2 backward: 7.75015 ms.
I0723 16:53:57.990030 658 caffe.cpp:453] norm2 forward: 17.3098 ms.
I0723 16:53:57.990042 658 caffe.cpp:456] norm2 backward: 23.2308 ms.
I0723 16:53:57.990057 658 caffe.cpp:453] pool2 forward: 3.84905 ms.
I0723 16:53:57.990070 658 caffe.cpp:456] pool2 backward: 10.9517 ms.
I0723 16:53:57.990084 658 caffe.cpp:453] conv3 forward: 314.533 ms.
I0723 16:53:57.990099 658 caffe.cpp:456] conv3 backward: 488.007 ms.
I0723 16:53:57.990113 658 caffe.cpp:453] relu3 forward: 1.30731 ms.
I0723 16:53:57.990128 658 caffe.cpp:456] relu3 backward: 2.15554 ms.
I0723 16:53:57.990159 658 caffe.cpp:453] conv4 forward: 317.564 ms.
I0723 16:53:57.990175 658 caffe.cpp:456] conv4 backward: 412.875 ms.
I0723 16:53:57.990190 658 caffe.cpp:453] relu4 forward: 1.27252 ms.
I0723 16:53:57.990203 658 caffe.cpp:456] relu4 backward: 1.99165 ms.
I0723 16:53:57.990216 658 caffe.cpp:453] conv5 forward: 291.408 ms.
I0723 16:53:57.990231 658 caffe.cpp:456] conv5 backward: 303.204 ms.
I0723 16:53:57.990242 658 caffe.cpp:453] relu5 forward: 0.952221 ms.
I0723 16:53:57.990252 658 caffe.cpp:456] relu5 backward: 1.61467 ms.
I0723 16:53:57.990262 658 caffe.cpp:453] pool5 forward: 1.00883 ms.
I0723 16:53:57.990278 658 caffe.cpp:456] pool5 backward: 2.92136 ms.
I0723 16:53:57.990291 658 caffe.cpp:453] fc6 forward: 45.956 ms.
I0723 16:53:57.990309 658 caffe.cpp:456] fc6 backward: 115.107 ms.
I0723 16:53:57.990329 658 caffe.cpp:453] relu6 forward: 0.345521 ms.
I0723 16:53:57.990345 658 caffe.cpp:456] relu6 backward: 0.37321 ms.
I0723 16:53:57.990361 658 caffe.cpp:453] drop6 forward: 5.01316 ms.
I0723 16:53:57.990378 658 caffe.cpp:456] drop6 backward: 0.438744 ms.
I0723 16:53:57.990397 658 caffe.cpp:453] fc7 forward: 20.8769 ms.
I0723 16:53:57.990417 658 caffe.cpp:456] fc7 backward: 52.161 ms.
I0723 16:53:57.990435 658 caffe.cpp:453] relu7 forward: 0.306402 ms.
I0723 16:53:57.990453 658 caffe.cpp:456] relu7 backward: 0.348346 ms.
I0723 16:53:57.990468 658 caffe.cpp:453] drop7 forward: 3.98771 ms.
I0723 16:53:57.990483 658 caffe.cpp:456] drop7 backward: 0.354259 ms.
I0723 16:53:57.990494 658 caffe.cpp:453] fc8 forward: 9.74994 ms.
I0723 16:53:57.990505 658 caffe.cpp:456] fc8 backward: 13.3348 ms.
I0723 16:53:57.990514 658 caffe.cpp:453] loss forward: 1.94023 ms.
I0723 16:53:57.990525 658 caffe.cpp:456] loss backward: 0.461751 ms.
I0723 16:53:57.990648 658 caffe.cpp:461] Average Forward pass: 1738.39 ms.
I0723 16:53:57.990667 658 caffe.cpp:463] Average Backward pass: 2691.41 ms.
I0723 16:53:57.990722 658 caffe.cpp:465] Average Forward-Backward: 4431.83 ms.
I0723 16:53:57.990741 658 caffe.cpp:467] Total Time: 22159.2 ms.
Intel spatial:
I0723 17:02:54.607259 1730 caffe.cpp:450] Average time per layer:
I0723 17:02:54.607276 1730 caffe.cpp:453] data forward: 0.116164 ms.
I0723 17:02:54.607296 1730 caffe.cpp:456] data backward: 0.123356 ms.
I0723 17:02:54.607314 1730 caffe.cpp:453] label forward: 0.107694 ms.
I0723 17:02:54.607331 1730 caffe.cpp:456] label backward: 0.188441 ms.
I0723 17:02:54.607347 1730 caffe.cpp:453] conv1 forward: 443.612 ms.
I0723 17:02:54.607363 1730 caffe.cpp:456] conv1 backward: 427.156 ms.
I0723 17:02:54.607379 1730 caffe.cpp:453] relu1 forward: 8.7127 ms.
I0723 17:02:54.607419 1730 caffe.cpp:456] relu1 backward: 15.2398 ms.
I0723 17:02:54.607455 1730 caffe.cpp:453] norm1 forward: 41.9368 ms.
I0723 17:02:54.607475 1730 caffe.cpp:456] norm1 backward: 62.7724 ms.
I0723 17:02:54.607496 1730 caffe.cpp:453] pool1 forward: 9.26116 ms.
I0723 17:02:54.607522 1730 caffe.cpp:456] pool1 backward: 28.762 ms.
I0723 17:02:54.607568 1730 caffe.cpp:453] conv2 forward: 1657.64 ms.
I0723 17:02:54.607631 1730 caffe.cpp:456] conv2 backward: 1108.42 ms.
I0723 17:02:54.607692 1730 caffe.cpp:453] relu2 forward: 7.24185 ms.
I0723 17:02:54.607743 1730 caffe.cpp:456] relu2 backward: 10.7396 ms.
I0723 17:02:54.607791 1730 caffe.cpp:453] norm2 forward: 28.8983 ms.
I0723 17:02:54.607834 1730 caffe.cpp:456] norm2 backward: 36.666 ms.
I0723 17:02:54.607883 1730 caffe.cpp:453] pool2 forward: 4.96558 ms.
I0723 17:02:54.607934 1730 caffe.cpp:456] pool2 backward: 17.8944 ms.
I0723 17:02:54.608018 1730 caffe.cpp:453] conv3 forward: 835.374 ms.
I0723 17:02:54.608065 1730 caffe.cpp:456] conv3 backward: 658.829 ms.
I0723 17:02:54.608108 1730 caffe.cpp:453] relu3 forward: 1.92744 ms.
I0723 17:02:54.608126 1730 caffe.cpp:456] relu3 backward: 4.70381 ms.
I0723 17:02:54.608196 1730 caffe.cpp:453] conv4 forward: 807.812 ms.
I0723 17:02:54.608238 1730 caffe.cpp:456] conv4 backward: 568.898 ms.
I0723 17:02:54.608268 1730 caffe.cpp:453] relu4 forward: 3.42795 ms.
I0723 17:02:54.608296 1730 caffe.cpp:456] relu4 backward: 3.19371 ms.
I0723 17:02:54.608325 1730 caffe.cpp:453] conv5 forward: 625.251 ms.
I0723 17:02:54.608355 1730 caffe.cpp:456] conv5 backward: 432.836 ms.
I0723 17:02:54.608387 1730 caffe.cpp:453] relu5 forward: 1.42619 ms.
I0723 17:02:54.608417 1730 caffe.cpp:456] relu5 backward: 3.28224 ms.
I0723 17:02:54.608445 1730 caffe.cpp:453] pool5 forward: 1.41549 ms.
I0723 17:02:54.608475 1730 caffe.cpp:456] pool5 backward: 3.6956 ms.
I0723 17:02:54.608507 1730 caffe.cpp:453] fc6 forward: 67.7576 ms.
I0723 17:02:54.608623 1730 caffe.cpp:456] fc6 backward: 158.356 ms.
I0723 17:02:54.608661 1730 caffe.cpp:453] relu6 forward: 0.383532 ms.
I0723 17:02:54.608695 1730 caffe.cpp:456] relu6 backward: 0.39943 ms.
I0723 17:02:54.608728 1730 caffe.cpp:453] drop6 forward: 5.45477 ms.
I0723 17:02:54.608758 1730 caffe.cpp:456] drop6 backward: 0.501933 ms.
I0723 17:02:54.608789 1730 caffe.cpp:453] fc7 forward: 36.5435 ms.
I0723 17:02:54.608824 1730 caffe.cpp:456] fc7 backward: 73.022 ms.
I0723 17:02:54.608857 1730 caffe.cpp:453] relu7 forward: 0.376915 ms.
I0723 17:02:54.608889 1730 caffe.cpp:456] relu7 backward: 0.325761 ms.
I0723 17:02:54.608927 1730 caffe.cpp:453] drop7 forward: 4.93873 ms.
I0723 17:02:54.608959 1730 caffe.cpp:456] drop7 backward: 0.372964 ms.
I0723 17:02:54.609012 1730 caffe.cpp:453] fc8 forward: 14.0754 ms.
I0723 17:02:54.609050 1730 caffe.cpp:456] fc8 backward: 22.107 ms.
I0723 17:02:54.609099 1730 caffe.cpp:453] loss forward: 2.38256 ms.
I0723 17:02:54.609174 1730 caffe.cpp:456] loss backward: 0.419859 ms.
I0723 17:02:54.609408 1730 caffe.cpp:461] Average Forward pass: 4624.77 ms.
I0723 17:02:54.609439 1730 caffe.cpp:463] Average Backward pass: 3652.86 ms.
I0723 17:02:54.609529 1730 caffe.cpp:465] Average Forward-Backward: 8282.62 ms.
I0723 17:02:54.609557 1730 caffe.cpp:467] Total Time: 41413.1 ms.
The numbers are vastly different from yours, so I believe there must be something wrong.
@naibaf7 oh, definitely No. Your SKL machine should be much faster than my GT2 machine, and should be comparable with the GT3e machine or even faster. From the log you paste above: I0723 17:02:54.607347 1730 caffe.cpp:453] conv1 forward: 443.612 ms. I0723 17:02:54.607363 1730 caffe.cpp:456] conv1 backward: 427.156 ms.
I highly doubt whether you were really using the spatial engine. You can easily uncomment the following code in the spatial convolution source code // #define dbg
Then, please remove .spatialkernels/* and re-run the benchmark. It will show the tuning process and print GFLOPS for each tuned kernel and the final winner kernel.
@gongzg It should be using the spatial kernels, as it had a really long time tuning on the first run. But here we go, the output does not look good:
Verification was not successful, fallback to basic kernel
Bechmarking kernel: U5_5_96_2_1_1_1_31_31_64_2_128_1_1_1_1_BASIC
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Estimated Gflops:28.6535
Estimated GFLOPS/S: 21.8862
Convolution Time:1309.2
@naibaf7 Thanks for the log. And now I know the reason:
Verification was not successful, fallback to basic kernel Bechmarking kernel: U5_5_96_2_1_1_1_31_31_64_2_128_1_1_1_1_BASIC
The beignet is broken at your system which can't get correct result with the optimized spatial kernel and fall back to the naive basic kernel. That's the reason why you get bad performance number. We may need beignet team's support again to find out why your beignet is broken.
@gongzg Yeah, it's much more difficult to get to work than I thought it would be... Do you know what the status is on Skylake (Iris Pro) beignet? Basically all Caffe tests pass fine, LibDNN verifies correctly, no issues there (with both beignet-1.1.1 that comes with Fedora 24 and the beignet-1.2 that I compiled from the current beignet-master). But the intel spatial convolution does not pass verification with this setup so far. From what I can tell, the difference is that the spatial convolution uses Intel specific extensions, and these do not seem to work?
@naibaf7 See the devices list https://cgit.freedesktop.org/beignet/tree/src/cl_device_id.c
Intel extensions in beignet are in https://cgit.freedesktop.org/beignet/tree/include/CL/cl_intel.h
@bhack Yeah, it''s "Intel(R) HD Graphics Skylake ULT GT3" which is in the list, should be fine. Thanks.
I think that the problem could be on libdrm and kernel version.. What versions of both are you using?
Kernel: 4.6.4-301.fc24.x86_64 Libdrm:
- Package libdrm-devel-2.4.68-1.fc24.x86_64
- Package libdrm-debuginfo-2.4.61-3.fc22.x86_64
- Package libdrm-2.4.68-1.fc24.x86_64
- Package libdrm-2.4.68-1.fc24.i686
Mhh.. Can you add a print of fixed_local_sz[i] inside the loop and before modulo at https://cgit.freedesktop.org/beignet/tree/src/cl_api.c#n3031
@gongzg I now tested this with 3 different versions of LLVM and Clang as you suggested:
- 3.6.2 built from scratch/source
- 3.7.4 from the Fedora 23 update repository
- 3.8.x from the Fedora 24 update repository
Cleaned out the .spatialkernels folder for every test, but same result.
Driver is xorg-x11-drv-intel-2.99.917-23.20160512.fc24.x86_64 by the way. Could that be the issue?
Have you tried to debug/print that loop?
I don't know if this Beignet Workgroup guide is still valid.
@bhack Oh sorry I missed your comment on printing the loop, I will follow up on that.
It is important to check if realGroupSize *= fixed_local_sz[i]; it is cumulated correctly. If you have compiled with debug symbols you can check also with gdb break points.
@bhack For the alexnet, the intel spatial convolution kernel always use a 1,1,16 group size which is valid for beignet. @naibaf7 I tested the benchmark64 on a SKL GT2 machine with the latest gitmaster beignet with LLVM 3.6 and got the following result:
I0725 04:35:59.524305 32761 caffe.cpp:448] Average time per layer: I0725 04:35:59.524317 32761 caffe.cpp:451] data forward: 0.088519 ms. I0725 04:35:59.524334 32761 caffe.cpp:454] data backward: 0.0850156 ms. I0725 04:35:59.524350 32761 caffe.cpp:451] label forward: 0.0851441 ms. I0725 04:35:59.524363 32761 caffe.cpp:454] label backward: 0.118466 ms. I0725 04:35:59.524376 32761 caffe.cpp:451] conv1 forward: 58.4319 ms. I0725 04:35:59.524392 32761 caffe.cpp:454] conv1 backward: 329.87 ms. I0725 04:35:59.524408 32761 caffe.cpp:451] relu1 forward: 6.13321 ms. I0725 04:35:59.524421 32761 caffe.cpp:454] relu1 backward: 9.00438 ms. I0725 04:35:59.524435 32761 caffe.cpp:451] norm1 forward: 31.4477 ms. I0725 04:35:59.524451 32761 caffe.cpp:454] norm1 backward: 38.0522 ms. I0725 04:35:59.524463 32761 caffe.cpp:451] pool1 forward: 7.27156 ms. I0725 04:35:59.524477 32761 caffe.cpp:454] pool1 backward: 25.028 ms. I0725 04:35:59.524490 32761 caffe.cpp:451] conv2 forward: 186.484 ms. I0725 04:35:59.524507 32761 caffe.cpp:454] conv2 backward: 1686.54 ms. I0725 04:35:59.524520 32761 caffe.cpp:451] relu2 forward: 3.97442 ms. I0725 04:35:59.524533 32761 caffe.cpp:454] relu2 backward: 5.8414 ms. I0725 04:35:59.524545 32761 caffe.cpp:451] norm2 forward: 19.9107 ms. I0725 04:35:59.524560 32761 caffe.cpp:454] norm2 backward: 23.4814 ms. I0725 04:35:59.524574 32761 caffe.cpp:451] pool2 forward: 4.57914 ms. I0725 04:35:59.524586 32761 caffe.cpp:454] pool2 backward: 16.3685 ms. I0725 04:35:59.524600 32761 caffe.cpp:451] conv3 forward: 68.6992 ms. I0725 04:35:59.524616 32761 caffe.cpp:454] conv3 backward: 628.469 ms. I0725 04:35:59.524629 32761 caffe.cpp:451] relu3 forward: 1.4288 ms. I0725 04:35:59.524641 32761 caffe.cpp:454] relu3 backward: 2.28515 ms. I0725 04:35:59.524654 32761 caffe.cpp:451] conv4 forward: 55.6638 ms. I0725 04:35:59.524669 32761 caffe.cpp:454] conv4 backward: 512.247 ms. I0725 04:35:59.524683 32761 caffe.cpp:451] relu4 forward: 1.46054 ms. I0725 04:35:59.524695 32761 caffe.cpp:454] relu4 backward: 2.3425 ms. I0725 04:35:59.524708 32761 caffe.cpp:451] conv5 forward: 38.6343 ms. I0725 04:35:59.524724 32761 caffe.cpp:454] conv5 backward: 365.608 ms. I0725 04:35:59.524739 32761 caffe.cpp:451] relu5 forward: 0.998164 ms. I0725 04:35:59.524751 32761 caffe.cpp:454] relu5 backward: 1.74181 ms. I0725 04:35:59.524765 32761 caffe.cpp:451] pool5 forward: 1.24395 ms. I0725 04:35:59.524777 32761 caffe.cpp:454] pool5 backward: 3.99459 ms. I0725 04:35:59.524790 32761 caffe.cpp:451] fc6 forward: 68.0091 ms. I0725 04:35:59.524806 32761 caffe.cpp:454] fc6 backward: 153.708 ms. I0725 04:35:59.524821 32761 caffe.cpp:451] relu6 forward: 0.352468 ms. I0725 04:35:59.524834 32761 caffe.cpp:454] relu6 backward: 0.365035 ms. I0725 04:35:59.524847 32761 caffe.cpp:451] drop6 forward: 4.93038 ms. I0725 04:35:59.524860 32761 caffe.cpp:454] drop6 backward: 0.368603 ms. I0725 04:35:59.524879 32761 caffe.cpp:451] fc7 forward: 29.1046 ms. I0725 04:35:59.524902 32761 caffe.cpp:454] fc7 backward: 69.5503 ms. I0725 04:35:59.524927 32761 caffe.cpp:451] relu7 forward: 0.271047 ms. I0725 04:35:59.524941 32761 caffe.cpp:454] relu7 backward: 0.311824 ms. I0725 04:35:59.524955 32761 caffe.cpp:451] drop7 forward: 3.00953 ms. I0725 04:35:59.524968 32761 caffe.cpp:454] drop7 backward: 0.337674 ms. I0725 04:35:59.524981 32761 caffe.cpp:451] fc8 forward: 9.80128 ms. I0725 04:35:59.524994 32761 caffe.cpp:454] fc8 backward: 17.2192 ms. I0725 04:35:59.525054 32761 caffe.cpp:451] loss forward: 1.44299 ms. I0725 04:35:59.525071 32761 caffe.cpp:454] loss backward: 0.307409 ms. I0725 04:35:59.525177 32761 caffe.cpp:459] Average Forward pass: 606.389 ms. I0725 04:35:59.525205 32761 caffe.cpp:461] Average Backward pass: 3901.76 ms. I0725 04:35:59.525254 32761 caffe.cpp:463] Average Forward-Backward: 4509.35 ms. I0725 04:35:59.525277 32761 caffe.cpp:465] Total Time: 45093.5 ms. I0725 04:35:59.525291 32761 caffe.cpp:466] *** Benchmark ends ***
The clinfo: Number of platforms 1 Platform Name Intel Gen OCL Driver Platform Vendor Intel Platform Version OpenCL 1.2 beignet 1.2 (git-b55060c) Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_motion_estimation cl_intel_subgroups Platform Extensions function suffix Intel
Platform Name Intel Gen OCL Driver Number of devices 1 Device Name Intel(R) HD Graphics Skylake Desktop GT2 Device Vendor Intel Device Vendor ID 0x8086 Device Version OpenCL 1.2 beignet 1.2 (git-b55060c) Driver Version 1.2 Device OpenCL C Version OpenCL C 1.2 beignet 1.2 (git-b55060c) Device Type GPU Device Profile FULL_PROFILE Max compute units 24 Max clock frequency 1000MHz Device Partition (core) Max number of sub-devices 1 Supported partition types None, None, None Max work item dimensions 3 Max work item sizes 512x512x512 Max work group size 512 Preferred work group size multiple 16
Kernel information: Linux gongzg-skl 4.6.2-040602-generic #201606100516 SMP Fri Jun 10 09:18:34 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
So it seems that beignet works fine with some SKL platforms under the above configurations. I will work with beignet team to try to reproduce your environment and issues.
@naibaf7 could you share the latest clinfo of your machine here? I saw the clinfo (clinfo_after) you sent to me last week, there is one clover device and one Intel CPU device.
@gongzg How can enter in https://cgit.freedesktop.org/beignet/tree/src/cl_api.c#n3036 if local_work_size is not NULL?
@bhack those output message should not come from the spatial convolution kernel and should from some other kernels. The spatial convolution kernels don't use null kernel size.
Ok so probably this message was generated by autotuning code. Where is "Verification was not successful, fallback to basic kernel" in code?