caffe icon indicating copy to clipboard operation
caffe copied to clipboard

Intel Beignet spatial convolution OpenCL compile failure

Open naibaf7 opened this issue 9 years ago • 40 comments

@gongzg Totally stuck here, tried to find out the cause for hours. Any ideas?

  • Beignet git master from today (21.07.2016)
  • Fedora 24, CLANG-3.8, LLVM-3.8, GCC-6.1.1
  • Command ./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=0 -iterations=5
  • Compiled with USE_GREENTEA, USE_LIBDNN, USE_INTEL_SPATIAL
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
*** Aborted at 1469222493 (unix time) try "date -d @1469222493" if you are using GNU date ***
PC: @     0x7fb9004d78b6 llvm::BasicBlock::getTerminator()
*** SIGSEGV (@0x30000024c) received by PID 3242 (TID 0x7fb916044a40) from PID 588; stack trace: ***
    @     0x7fb90fcffc30 (unknown)
    @     0x7fb9004d78b6 llvm::BasicBlock::getTerminator()
    @     0x7fb900425b05 llvm::LoopBase<>::getExitBlocks()
    @     0x7fb900427c65 llvm::Loop::hasDedicatedExits()
    @     0x7fb900427e16 llvm::Loop::getLoopID()
    @     0x7fb8ffde530c gbe::CustomLoopUnroll::GetUnrollMetadataValue()
    @     0x7fb8ffde5e2a gbe::CustomLoopUnroll::runOnLoop()
    @     0x7fb90043100b llvm::LPPassManager::runOnFunction()
    @     0x7fb9005e2528 llvm::FPPassManager::runOnFunction()
    @     0x7fb9003be3d7 (anonymous namespace)::CGPassManager::runOnModule()
    @     0x7fb9005e2bdd llvm::legacy::PassManagerImpl::run()
    @     0x7fb8ffde02fb gbe::runModulePass()
    @     0x7fb8ffde088e gbe::llvmToGen()
    @     0x7fb8ffd38023 gbe::Program::buildFromLLVMFile()
    @     0x7fb8fff3f8b9 gbe::genProgramNewFromLLVM()
    @     0x7fb8ffd3c7b5 gbe::programNewFromSource()
    @     0x7fb900f76509 cl_program_build
    @     0x7fb900f69d98 clBuildProgram
    @     0x7fb915a8524a viennacl::ocl::context::add_program()
    @     0x7fb915a82830 caffe::submit_conv_spatial_program()
    @     0x7fb915b55f28 caffe::ConvolutionLayerSpatial<>::setup_IDLF()
    @     0x7fb915b56a75 caffe::ConvolutionLayerSpatial<>::setup_convolution()
    @     0x7fb915b58a01 caffe::ConvolutionLayerSpatial<>::Forward_gpu()
    @     0x7fb915a51e82 caffe::Net<>::ForwardFromTo()
    @     0x7fb915a51f77 caffe::Net<>::Forward()
    @           0x414d93 time()
    @           0x40ea4e main
    @     0x7fb90f94d731 __libc_start_main
    @           0x40f389 _start
Segmentation fault (core dumped)

naibaf7 avatar Jul 22 '16 21:07 naibaf7

@gongzg Maybe worth noting that ./build/test/test_all.testbin --gtest_filter=*Spatial* 1 passes without errors. But on the actual AlexNet that would be interesting as a benchmark, it fails.

naibaf7 avatar Jul 22 '16 21:07 naibaf7

@naibaf7 The error message indicates this is a LLVM related issue. I would suggest to switch to LLVM 3.6 to have a try. If you still have any issue, please let me know.

gongzg avatar Jul 23 '16 00:07 gongzg

@naibaf7 Another quick try is to open the file backend/src/llvm/llvm_to_gen.cpp and find the following code, then comment out the "MPM.add(createCustomLoopUnrollPass());" Then have a try with you current llvm version. But this is not recommended. I doubt whether beignet is tested with this LLVM and don't know whether there is any other issue. Anyway, for your refernece.

#if !defined(ANDROID) MPM.add(createCustomLoopUnrollPass()); //1024, 32, 1024, 512)); //Unroll loops #endif

gongzg avatar Jul 23 '16 00:07 gongzg

@gongzg Ok good to know. Interesting though that some of the code (as for example the tests) do work. I can't downgrade to llvm-3.6 or llvm-3.7 on Fedora 24 without destroying the graphics driver (X11 won't start anymore for some reason), so I'll try the "dirty fix" instead and hope beignet development will catch up soon.

naibaf7 avatar Jul 23 '16 02:07 naibaf7

@gongzg Ok nice, I finally got it working. But the performance doesn't quite make sense to me. Do you know what could have gone wrong now?

Command: ./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=1 -iterations=5

Results (average forward pass time):

  • Default convolution engine (ViennaCL-BLAS): 2.2 seconds
  • LibDNN convolution engine (unreleased version 2): 1.4 seconds
  • Spatial convolution engine (Intel): 4.4 seconds

naibaf7 avatar Jul 23 '16 14:07 naibaf7

@naibaf7 did you check the break down performance for each layer? I used to see very bad GEMM performance with either ISAAC or ViennaCL blas, and most of the time is for the convolution backward path and the FC layers. And I found you specified the gpu device 1, do you have more than one OCL device in your system?

gongzg avatar Jul 23 '16 14:07 gongzg

@gongzg Yes but except for the forward convolution, all variants use the same kernels on all other layers. Yes, GPU device 0 is the Intel CPU on my system.

  • Default convolution engine (clBLAS 2.4): 2.7 seconds, so a bit slower than ViennaCL-BLAS.
I0723 16:36:23.906051 30903 common.cpp:373] Total devices: 2
I0723 16:36:23.906175 30903 common.cpp:374] CUDA devices: 0
I0723 16:36:23.906180 30903 common.cpp:375] OpenCL devices: 2
I0723 16:36:23.906185 30903 common.cpp:399] Device id:                     0
I0723 16:36:23.906189 30903 common.cpp:401] Device backend:                OpenCL
I0723 16:36:23.906213 30903 common.cpp:403] Backend details:               Intel(R) Corporation: OpenCL 1.2 LINUX
I0723 16:36:23.906244 30903 common.cpp:405] Device vendor:                 Intel(R) Corporation
I0723 16:36:23.906265 30903 common.cpp:407] Name:                          Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
I0723 16:36:23.906518 30903 common.cpp:409] Total global memory:           7917195264
I0723 16:36:23.906525 30903 common.cpp:399] Device id:                     1
I0723 16:36:23.906529 30903 common.cpp:401] Device backend:                OpenCL
I0723 16:36:23.906535 30903 common.cpp:403] Backend details:               Intel: OpenCL 1.2 beignet 1.2 (git-b55060c)
I0723 16:36:23.906539 30903 common.cpp:405] Device vendor:                 Intel
I0723 16:36:23.906543 30903 common.cpp:407] Name:                          Intel(R) HD Graphics Skylake ULT GT3
I0723 16:36:23.906565 30903 common.cpp:409] Total global memory:           3958374400

naibaf7 avatar Jul 23 '16 14:07 naibaf7

@naibaf7 could you share the average forward time and backward time. The backward time is really slow. For am example. on my BDW GT2 machine, what I got from benchmark64.prototxt is: average forward pass: 832ms. average backward pass: 3834ms.

I believe libDNN engine should be much faster at backward pass.

gongzg avatar Jul 23 '16 14:07 gongzg

@naibaf7 I just did a test on a BDW GT3e machine, get the following performance number with spatial convolution engine: average forward time: 277.8ms average backward time: 1139ms This GPU should be very close to your machine, but I'm using the OpenCL SDK. Will find a SKL GT3 machine next week and use beignet to do some test.

gongzg avatar Jul 23 '16 14:07 gongzg

@gongzg That's interesting, so the BDW GT2 is faster than the Skylake? Which engine did you use with the numbers you posted there? ViennaCL BLAS per layer:

I0723 16:53:57.989781   658 caffe.cpp:450] Average time per layer: 
I0723 16:53:57.989796   658 caffe.cpp:453]       data   forward: 0.100127 ms.
I0723 16:53:57.989814   658 caffe.cpp:456]       data   backward: 0.097255 ms.
I0723 16:53:57.989830   658 caffe.cpp:453]      label   forward: 0.0986646 ms.
I0723 16:53:57.989845   658 caffe.cpp:456]      label   backward: 0.116754 ms.
I0723 16:53:57.989859   658 caffe.cpp:453]      conv1   forward: 197.391 ms.
I0723 16:53:57.989874   658 caffe.cpp:456]      conv1   backward: 339.864 ms.
I0723 16:53:57.989889   658 caffe.cpp:453]      relu1   forward: 7.03235 ms.
I0723 16:53:57.989902   658 caffe.cpp:456]      relu1   backward: 13.2872 ms.
I0723 16:53:57.989917   658 caffe.cpp:453]      norm1   forward: 28.9102 ms.
I0723 16:53:57.989931   658 caffe.cpp:456]      norm1   backward: 40.849 ms.
I0723 16:53:57.989946   658 caffe.cpp:453]      pool1   forward: 6.65354 ms.
I0723 16:53:57.989959   658 caffe.cpp:456]      pool1   backward: 16.7646 ms.
I0723 16:53:57.989974   658 caffe.cpp:453]      conv2   forward: 447.242 ms.
I0723 16:53:57.989987   658 caffe.cpp:456]      conv2   backward: 833.264 ms.
I0723 16:53:57.990001   658 caffe.cpp:453]      relu2   forward: 3.9086 ms.
I0723 16:53:57.990015   658 caffe.cpp:456]      relu2   backward: 7.75015 ms.
I0723 16:53:57.990030   658 caffe.cpp:453]      norm2   forward: 17.3098 ms.
I0723 16:53:57.990042   658 caffe.cpp:456]      norm2   backward: 23.2308 ms.
I0723 16:53:57.990057   658 caffe.cpp:453]      pool2   forward: 3.84905 ms.
I0723 16:53:57.990070   658 caffe.cpp:456]      pool2   backward: 10.9517 ms.
I0723 16:53:57.990084   658 caffe.cpp:453]      conv3   forward: 314.533 ms.
I0723 16:53:57.990099   658 caffe.cpp:456]      conv3   backward: 488.007 ms.
I0723 16:53:57.990113   658 caffe.cpp:453]      relu3   forward: 1.30731 ms.
I0723 16:53:57.990128   658 caffe.cpp:456]      relu3   backward: 2.15554 ms.
I0723 16:53:57.990159   658 caffe.cpp:453]      conv4   forward: 317.564 ms.
I0723 16:53:57.990175   658 caffe.cpp:456]      conv4   backward: 412.875 ms.
I0723 16:53:57.990190   658 caffe.cpp:453]      relu4   forward: 1.27252 ms.
I0723 16:53:57.990203   658 caffe.cpp:456]      relu4   backward: 1.99165 ms.
I0723 16:53:57.990216   658 caffe.cpp:453]      conv5   forward: 291.408 ms.
I0723 16:53:57.990231   658 caffe.cpp:456]      conv5   backward: 303.204 ms.
I0723 16:53:57.990242   658 caffe.cpp:453]      relu5   forward: 0.952221 ms.
I0723 16:53:57.990252   658 caffe.cpp:456]      relu5   backward: 1.61467 ms.
I0723 16:53:57.990262   658 caffe.cpp:453]      pool5   forward: 1.00883 ms.
I0723 16:53:57.990278   658 caffe.cpp:456]      pool5   backward: 2.92136 ms.
I0723 16:53:57.990291   658 caffe.cpp:453]        fc6   forward: 45.956 ms.
I0723 16:53:57.990309   658 caffe.cpp:456]        fc6   backward: 115.107 ms.
I0723 16:53:57.990329   658 caffe.cpp:453]      relu6   forward: 0.345521 ms.
I0723 16:53:57.990345   658 caffe.cpp:456]      relu6   backward: 0.37321 ms.
I0723 16:53:57.990361   658 caffe.cpp:453]      drop6   forward: 5.01316 ms.
I0723 16:53:57.990378   658 caffe.cpp:456]      drop6   backward: 0.438744 ms.
I0723 16:53:57.990397   658 caffe.cpp:453]        fc7   forward: 20.8769 ms.
I0723 16:53:57.990417   658 caffe.cpp:456]        fc7   backward: 52.161 ms.
I0723 16:53:57.990435   658 caffe.cpp:453]      relu7   forward: 0.306402 ms.
I0723 16:53:57.990453   658 caffe.cpp:456]      relu7   backward: 0.348346 ms.
I0723 16:53:57.990468   658 caffe.cpp:453]      drop7   forward: 3.98771 ms.
I0723 16:53:57.990483   658 caffe.cpp:456]      drop7   backward: 0.354259 ms.
I0723 16:53:57.990494   658 caffe.cpp:453]        fc8   forward: 9.74994 ms.
I0723 16:53:57.990505   658 caffe.cpp:456]        fc8   backward: 13.3348 ms.
I0723 16:53:57.990514   658 caffe.cpp:453]       loss   forward: 1.94023 ms.
I0723 16:53:57.990525   658 caffe.cpp:456]       loss   backward: 0.461751 ms.
I0723 16:53:57.990648   658 caffe.cpp:461] Average Forward pass: 1738.39 ms.
I0723 16:53:57.990667   658 caffe.cpp:463] Average Backward pass: 2691.41 ms.
I0723 16:53:57.990722   658 caffe.cpp:465] Average Forward-Backward: 4431.83 ms.
I0723 16:53:57.990741   658 caffe.cpp:467] Total Time: 22159.2 ms.

Intel spatial:

I0723 17:02:54.607259  1730 caffe.cpp:450] Average time per layer: 
I0723 17:02:54.607276  1730 caffe.cpp:453]       data   forward: 0.116164 ms.
I0723 17:02:54.607296  1730 caffe.cpp:456]       data   backward: 0.123356 ms.
I0723 17:02:54.607314  1730 caffe.cpp:453]      label   forward: 0.107694 ms.
I0723 17:02:54.607331  1730 caffe.cpp:456]      label   backward: 0.188441 ms.
I0723 17:02:54.607347  1730 caffe.cpp:453]      conv1   forward: 443.612 ms.
I0723 17:02:54.607363  1730 caffe.cpp:456]      conv1   backward: 427.156 ms.
I0723 17:02:54.607379  1730 caffe.cpp:453]      relu1   forward: 8.7127 ms.
I0723 17:02:54.607419  1730 caffe.cpp:456]      relu1   backward: 15.2398 ms.
I0723 17:02:54.607455  1730 caffe.cpp:453]      norm1   forward: 41.9368 ms.
I0723 17:02:54.607475  1730 caffe.cpp:456]      norm1   backward: 62.7724 ms.
I0723 17:02:54.607496  1730 caffe.cpp:453]      pool1   forward: 9.26116 ms.
I0723 17:02:54.607522  1730 caffe.cpp:456]      pool1   backward: 28.762 ms.
I0723 17:02:54.607568  1730 caffe.cpp:453]      conv2   forward: 1657.64 ms.
I0723 17:02:54.607631  1730 caffe.cpp:456]      conv2   backward: 1108.42 ms.
I0723 17:02:54.607692  1730 caffe.cpp:453]      relu2   forward: 7.24185 ms.
I0723 17:02:54.607743  1730 caffe.cpp:456]      relu2   backward: 10.7396 ms.
I0723 17:02:54.607791  1730 caffe.cpp:453]      norm2   forward: 28.8983 ms.
I0723 17:02:54.607834  1730 caffe.cpp:456]      norm2   backward: 36.666 ms.
I0723 17:02:54.607883  1730 caffe.cpp:453]      pool2   forward: 4.96558 ms.
I0723 17:02:54.607934  1730 caffe.cpp:456]      pool2   backward: 17.8944 ms.
I0723 17:02:54.608018  1730 caffe.cpp:453]      conv3   forward: 835.374 ms.
I0723 17:02:54.608065  1730 caffe.cpp:456]      conv3   backward: 658.829 ms.
I0723 17:02:54.608108  1730 caffe.cpp:453]      relu3   forward: 1.92744 ms.
I0723 17:02:54.608126  1730 caffe.cpp:456]      relu3   backward: 4.70381 ms.
I0723 17:02:54.608196  1730 caffe.cpp:453]      conv4   forward: 807.812 ms.
I0723 17:02:54.608238  1730 caffe.cpp:456]      conv4   backward: 568.898 ms.
I0723 17:02:54.608268  1730 caffe.cpp:453]      relu4   forward: 3.42795 ms.
I0723 17:02:54.608296  1730 caffe.cpp:456]      relu4   backward: 3.19371 ms.
I0723 17:02:54.608325  1730 caffe.cpp:453]      conv5   forward: 625.251 ms.
I0723 17:02:54.608355  1730 caffe.cpp:456]      conv5   backward: 432.836 ms.
I0723 17:02:54.608387  1730 caffe.cpp:453]      relu5   forward: 1.42619 ms.
I0723 17:02:54.608417  1730 caffe.cpp:456]      relu5   backward: 3.28224 ms.
I0723 17:02:54.608445  1730 caffe.cpp:453]      pool5   forward: 1.41549 ms.
I0723 17:02:54.608475  1730 caffe.cpp:456]      pool5   backward: 3.6956 ms.
I0723 17:02:54.608507  1730 caffe.cpp:453]        fc6   forward: 67.7576 ms.
I0723 17:02:54.608623  1730 caffe.cpp:456]        fc6   backward: 158.356 ms.
I0723 17:02:54.608661  1730 caffe.cpp:453]      relu6   forward: 0.383532 ms.
I0723 17:02:54.608695  1730 caffe.cpp:456]      relu6   backward: 0.39943 ms.
I0723 17:02:54.608728  1730 caffe.cpp:453]      drop6   forward: 5.45477 ms.
I0723 17:02:54.608758  1730 caffe.cpp:456]      drop6   backward: 0.501933 ms.
I0723 17:02:54.608789  1730 caffe.cpp:453]        fc7   forward: 36.5435 ms.
I0723 17:02:54.608824  1730 caffe.cpp:456]        fc7   backward: 73.022 ms.
I0723 17:02:54.608857  1730 caffe.cpp:453]      relu7   forward: 0.376915 ms.
I0723 17:02:54.608889  1730 caffe.cpp:456]      relu7   backward: 0.325761 ms.
I0723 17:02:54.608927  1730 caffe.cpp:453]      drop7   forward: 4.93873 ms.
I0723 17:02:54.608959  1730 caffe.cpp:456]      drop7   backward: 0.372964 ms.
I0723 17:02:54.609012  1730 caffe.cpp:453]        fc8   forward: 14.0754 ms.
I0723 17:02:54.609050  1730 caffe.cpp:456]        fc8   backward: 22.107 ms.
I0723 17:02:54.609099  1730 caffe.cpp:453]       loss   forward: 2.38256 ms.
I0723 17:02:54.609174  1730 caffe.cpp:456]       loss   backward: 0.419859 ms.
I0723 17:02:54.609408  1730 caffe.cpp:461] Average Forward pass: 4624.77 ms.
I0723 17:02:54.609439  1730 caffe.cpp:463] Average Backward pass: 3652.86 ms.
I0723 17:02:54.609529  1730 caffe.cpp:465] Average Forward-Backward: 8282.62 ms.
I0723 17:02:54.609557  1730 caffe.cpp:467] Total Time: 41413.1 ms.

The numbers are vastly different from yours, so I believe there must be something wrong.

naibaf7 avatar Jul 23 '16 15:07 naibaf7

@naibaf7 oh, definitely No. Your SKL machine should be much faster than my GT2 machine, and should be comparable with the GT3e machine or even faster. From the log you paste above: I0723 17:02:54.607347 1730 caffe.cpp:453] conv1 forward: 443.612 ms. I0723 17:02:54.607363 1730 caffe.cpp:456] conv1 backward: 427.156 ms.

I highly doubt whether you were really using the spatial engine. You can easily uncomment the following code in the spatial convolution source code // #define dbg

Then, please remove .spatialkernels/* and re-run the benchmark. It will show the tuning process and print GFLOPS for each tuned kernel and the final winner kernel.

gongzg avatar Jul 23 '16 15:07 gongzg

@gongzg It should be using the spatial kernels, as it had a really long time tuning on the first run. But here we go, the output does not look good:

Verification was not successful, fallback to basic kernel
Bechmarking kernel: U5_5_96_2_1_1_1_31_31_64_2_128_1_1_1_1_BASIC
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
    Estimated Gflops:28.6535
    Estimated GFLOPS/S: 21.8862
Convolution Time:1309.2

naibaf7 avatar Jul 23 '16 15:07 naibaf7

@naibaf7 Thanks for the log. And now I know the reason:

Verification was not successful, fallback to basic kernel Bechmarking kernel: U5_5_96_2_1_1_1_31_31_64_2_128_1_1_1_1_BASIC

The beignet is broken at your system which can't get correct result with the optimized spatial kernel and fall back to the naive basic kernel. That's the reason why you get bad performance number. We may need beignet team's support again to find out why your beignet is broken.

gongzg avatar Jul 23 '16 15:07 gongzg

@gongzg Yeah, it's much more difficult to get to work than I thought it would be... Do you know what the status is on Skylake (Iris Pro) beignet? Basically all Caffe tests pass fine, LibDNN verifies correctly, no issues there (with both beignet-1.1.1 that comes with Fedora 24 and the beignet-1.2 that I compiled from the current beignet-master). But the intel spatial convolution does not pass verification with this setup so far. From what I can tell, the difference is that the spatial convolution uses Intel specific extensions, and these do not seem to work?

naibaf7 avatar Jul 23 '16 15:07 naibaf7

@naibaf7 See the devices list https://cgit.freedesktop.org/beignet/tree/src/cl_device_id.c

bhack avatar Jul 23 '16 15:07 bhack

Intel extensions in beignet are in https://cgit.freedesktop.org/beignet/tree/include/CL/cl_intel.h

bhack avatar Jul 23 '16 16:07 bhack

@bhack Yeah, it''s "Intel(R) HD Graphics Skylake ULT GT3" which is in the list, should be fine. Thanks.

naibaf7 avatar Jul 23 '16 16:07 naibaf7

I think that the problem could be on libdrm and kernel version.. What versions of both are you using?

bhack avatar Jul 23 '16 16:07 bhack

Kernel: 4.6.4-301.fc24.x86_64 Libdrm:

  • Package libdrm-devel-2.4.68-1.fc24.x86_64
  • Package libdrm-debuginfo-2.4.61-3.fc22.x86_64
  • Package libdrm-2.4.68-1.fc24.x86_64
  • Package libdrm-2.4.68-1.fc24.i686

naibaf7 avatar Jul 23 '16 16:07 naibaf7

Mhh.. Can you add a print of fixed_local_sz[i] inside the loop and before modulo at https://cgit.freedesktop.org/beignet/tree/src/cl_api.c#n3031

bhack avatar Jul 23 '16 16:07 bhack

@gongzg I now tested this with 3 different versions of LLVM and Clang as you suggested:

  • 3.6.2 built from scratch/source
  • 3.7.4 from the Fedora 23 update repository
  • 3.8.x from the Fedora 24 update repository

Cleaned out the .spatialkernels folder for every test, but same result.

Driver is xorg-x11-drv-intel-2.99.917-23.20160512.fc24.x86_64 by the way. Could that be the issue?

naibaf7 avatar Jul 24 '16 04:07 naibaf7

Have you tried to debug/print that loop?

bhack avatar Jul 24 '16 07:07 bhack

I don't know if this Beignet Workgroup guide is still valid.

bhack avatar Jul 24 '16 11:07 bhack

@bhack Oh sorry I missed your comment on printing the loop, I will follow up on that.

naibaf7 avatar Jul 24 '16 13:07 naibaf7

It is important to check if realGroupSize *= fixed_local_sz[i]; it is cumulated correctly. If you have compiled with debug symbols you can check also with gdb break points.

bhack avatar Jul 24 '16 14:07 bhack

@bhack For the alexnet, the intel spatial convolution kernel always use a 1,1,16 group size which is valid for beignet. @naibaf7 I tested the benchmark64 on a SKL GT2 machine with the latest gitmaster beignet with LLVM 3.6 and got the following result:

I0725 04:35:59.524305 32761 caffe.cpp:448] Average time per layer: I0725 04:35:59.524317 32761 caffe.cpp:451] data forward: 0.088519 ms. I0725 04:35:59.524334 32761 caffe.cpp:454] data backward: 0.0850156 ms. I0725 04:35:59.524350 32761 caffe.cpp:451] label forward: 0.0851441 ms. I0725 04:35:59.524363 32761 caffe.cpp:454] label backward: 0.118466 ms. I0725 04:35:59.524376 32761 caffe.cpp:451] conv1 forward: 58.4319 ms. I0725 04:35:59.524392 32761 caffe.cpp:454] conv1 backward: 329.87 ms. I0725 04:35:59.524408 32761 caffe.cpp:451] relu1 forward: 6.13321 ms. I0725 04:35:59.524421 32761 caffe.cpp:454] relu1 backward: 9.00438 ms. I0725 04:35:59.524435 32761 caffe.cpp:451] norm1 forward: 31.4477 ms. I0725 04:35:59.524451 32761 caffe.cpp:454] norm1 backward: 38.0522 ms. I0725 04:35:59.524463 32761 caffe.cpp:451] pool1 forward: 7.27156 ms. I0725 04:35:59.524477 32761 caffe.cpp:454] pool1 backward: 25.028 ms. I0725 04:35:59.524490 32761 caffe.cpp:451] conv2 forward: 186.484 ms. I0725 04:35:59.524507 32761 caffe.cpp:454] conv2 backward: 1686.54 ms. I0725 04:35:59.524520 32761 caffe.cpp:451] relu2 forward: 3.97442 ms. I0725 04:35:59.524533 32761 caffe.cpp:454] relu2 backward: 5.8414 ms. I0725 04:35:59.524545 32761 caffe.cpp:451] norm2 forward: 19.9107 ms. I0725 04:35:59.524560 32761 caffe.cpp:454] norm2 backward: 23.4814 ms. I0725 04:35:59.524574 32761 caffe.cpp:451] pool2 forward: 4.57914 ms. I0725 04:35:59.524586 32761 caffe.cpp:454] pool2 backward: 16.3685 ms. I0725 04:35:59.524600 32761 caffe.cpp:451] conv3 forward: 68.6992 ms. I0725 04:35:59.524616 32761 caffe.cpp:454] conv3 backward: 628.469 ms. I0725 04:35:59.524629 32761 caffe.cpp:451] relu3 forward: 1.4288 ms. I0725 04:35:59.524641 32761 caffe.cpp:454] relu3 backward: 2.28515 ms. I0725 04:35:59.524654 32761 caffe.cpp:451] conv4 forward: 55.6638 ms. I0725 04:35:59.524669 32761 caffe.cpp:454] conv4 backward: 512.247 ms. I0725 04:35:59.524683 32761 caffe.cpp:451] relu4 forward: 1.46054 ms. I0725 04:35:59.524695 32761 caffe.cpp:454] relu4 backward: 2.3425 ms. I0725 04:35:59.524708 32761 caffe.cpp:451] conv5 forward: 38.6343 ms. I0725 04:35:59.524724 32761 caffe.cpp:454] conv5 backward: 365.608 ms. I0725 04:35:59.524739 32761 caffe.cpp:451] relu5 forward: 0.998164 ms. I0725 04:35:59.524751 32761 caffe.cpp:454] relu5 backward: 1.74181 ms. I0725 04:35:59.524765 32761 caffe.cpp:451] pool5 forward: 1.24395 ms. I0725 04:35:59.524777 32761 caffe.cpp:454] pool5 backward: 3.99459 ms. I0725 04:35:59.524790 32761 caffe.cpp:451] fc6 forward: 68.0091 ms. I0725 04:35:59.524806 32761 caffe.cpp:454] fc6 backward: 153.708 ms. I0725 04:35:59.524821 32761 caffe.cpp:451] relu6 forward: 0.352468 ms. I0725 04:35:59.524834 32761 caffe.cpp:454] relu6 backward: 0.365035 ms. I0725 04:35:59.524847 32761 caffe.cpp:451] drop6 forward: 4.93038 ms. I0725 04:35:59.524860 32761 caffe.cpp:454] drop6 backward: 0.368603 ms. I0725 04:35:59.524879 32761 caffe.cpp:451] fc7 forward: 29.1046 ms. I0725 04:35:59.524902 32761 caffe.cpp:454] fc7 backward: 69.5503 ms. I0725 04:35:59.524927 32761 caffe.cpp:451] relu7 forward: 0.271047 ms. I0725 04:35:59.524941 32761 caffe.cpp:454] relu7 backward: 0.311824 ms. I0725 04:35:59.524955 32761 caffe.cpp:451] drop7 forward: 3.00953 ms. I0725 04:35:59.524968 32761 caffe.cpp:454] drop7 backward: 0.337674 ms. I0725 04:35:59.524981 32761 caffe.cpp:451] fc8 forward: 9.80128 ms. I0725 04:35:59.524994 32761 caffe.cpp:454] fc8 backward: 17.2192 ms. I0725 04:35:59.525054 32761 caffe.cpp:451] loss forward: 1.44299 ms. I0725 04:35:59.525071 32761 caffe.cpp:454] loss backward: 0.307409 ms. I0725 04:35:59.525177 32761 caffe.cpp:459] Average Forward pass: 606.389 ms. I0725 04:35:59.525205 32761 caffe.cpp:461] Average Backward pass: 3901.76 ms. I0725 04:35:59.525254 32761 caffe.cpp:463] Average Forward-Backward: 4509.35 ms. I0725 04:35:59.525277 32761 caffe.cpp:465] Total Time: 45093.5 ms. I0725 04:35:59.525291 32761 caffe.cpp:466] *** Benchmark ends ***

The clinfo: Number of platforms 1 Platform Name Intel Gen OCL Driver Platform Vendor Intel Platform Version OpenCL 1.2 beignet 1.2 (git-b55060c) Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_motion_estimation cl_intel_subgroups Platform Extensions function suffix Intel

Platform Name Intel Gen OCL Driver Number of devices 1 Device Name Intel(R) HD Graphics Skylake Desktop GT2 Device Vendor Intel Device Vendor ID 0x8086 Device Version OpenCL 1.2 beignet 1.2 (git-b55060c) Driver Version 1.2 Device OpenCL C Version OpenCL C 1.2 beignet 1.2 (git-b55060c) Device Type GPU Device Profile FULL_PROFILE Max compute units 24 Max clock frequency 1000MHz Device Partition (core) Max number of sub-devices 1 Supported partition types None, None, None Max work item dimensions 3 Max work item sizes 512x512x512 Max work group size 512 Preferred work group size multiple 16

Kernel information: Linux gongzg-skl 4.6.2-040602-generic #201606100516 SMP Fri Jun 10 09:18:34 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

So it seems that beignet works fine with some SKL platforms under the above configurations. I will work with beignet team to try to reproduce your environment and issues.

gongzg avatar Jul 25 '16 03:07 gongzg

@naibaf7 could you share the latest clinfo of your machine here? I saw the clinfo (clinfo_after) you sent to me last week, there is one clover device and one Intel CPU device.

gongzg avatar Jul 25 '16 03:07 gongzg

@gongzg How can enter in https://cgit.freedesktop.org/beignet/tree/src/cl_api.c#n3036 if local_work_size is not NULL?

bhack avatar Jul 25 '16 06:07 bhack

@bhack those output message should not come from the spatial convolution kernel and should from some other kernels. The spatial convolution kernels don't use null kernel size.

gongzg avatar Jul 25 '16 06:07 gongzg

Ok so probably this message was generated by autotuning code. Where is "Verification was not successful, fallback to basic kernel" in code?

bhack avatar Jul 25 '16 07:07 bhack