caffe clBuildProgram segfaults when building libDNN kernels on Snapdragon 835

I've encountered segfaults in Caffe with libDNN on a Snapdragon 835 powered smartphone.

Caffe succeeds to build the main Greentea kernels:

ViennaCL: Adding new queue for device 0x7f9bdaaac8 to context 0x7f9a648280
ViennaCL: Context no. 0 initialized with 1 devices
ViennaCL: Device id: 0x7f9bdaaac8
I0706 08:49:19.158033 13253 device.cpp:62] CL_DEVICE_HOST_UNIFIED_MEMORY: 1
ViennaCL: Adding program 'kernel_program' with source to context 0x7f9a648280
ViennaCL: clCreateProgramWithSource
ViennaCL: source_text (100 out of 337011 bytes):
#define ENABLE_DOUBLE_SUPPORT
#ifndef __OPENCL_VERSION__
#define __kernel
#define __global
#define _
ViennaCL: clCreateProgramWithSource returned 0
ViennaCL: clBuildProgram options:
ViennaCL: clBuildProgram returned 0
ViennaCL: clBuildProgram err==CL_SUCCESS
ViennaCL: clCreateKernelsInProgram

After the net is initialized, however, Caffe attempts to build the libDNN kernels and segfaults in clBuildProgram:

I0706 08:49:20.207176 13253 layer_factory.cpp:67] Creating layer input
I0706 08:49:20.207232 13253 net.cpp:96] Creating Layer input
I0706 08:49:20.207242 13253 net.cpp:413] input -> data
I0706 08:49:20.207264 13253 net.cpp:134] Setting up input
I0706 08:49:20.207273 13253 net.cpp:142] Top shape: 1 3 227 227 (154587)
I0706 08:49:20.207284 13253 layer_factory.cpp:67] Creating layer conv1
I0706 08:49:20.207298 13253 net.cpp:96] Creating Layer conv1
I0706 08:49:20.207304 13253 net.cpp:444] conv1 <- data
I0706 08:49:20.207311 13253 net.cpp:413] conv1 -> conv1
I0706 08:49:20.207458 13253 libdnn_conv.cpp:21] LibDNNConv<Dtype>::LibDNNConv::1
I0706 08:49:20.207610 13253 libdnn_conv.cpp:1622] LibDNNConv<Dtype>::GenerateKernels::1
I0706 08:49:20.207998 13253 libdnn_conv.cpp:1635] LibDNNConv<Dtype>::GenerateKernels::2
I0706 08:49:20.208006 13253 libdnn.cpp:218] LibDNN<Dtype>::CompileKernels::1
I0706 08:49:20.208012 13253 libdnn.cpp:229] LibDNN<Dtype>::CompileKernels::2
I0706 08:49:20.208189 13253 libdnn.cpp:236] LibDNN<Dtype>::CompileKernels::3
I0706 08:49:20.208197 13253 libdnn.cpp:241] LibDNN<Dtype>::CompileKernels::4
I0706 08:49:20.208202 13253 libdnn.cpp:258] LibDNN<Dtype>::CompileKernelsOpenCL::1
I0706 08:49:20.208209 13253 libdnn.cpp:272] LibDNN<Dtype>::CompileKernelsOpenCL::2
ViennaCL: Adding program 'kernel_program' with source to context 0x7f9a648280
ViennaCL: clCreateProgramWithSource
ViennaCL: source_text (100 out of 23990 bytes):
#if defined(cl_khr_int32_base_atomics)
#pragma OPENCL EXTENSION cl_khr_int32_base_atomics : enable
#
ViennaCL: clCreateProgramWithSource returned 0
ViennaCL: clBuildProgram options: -cl-fast-relaxed-math -cl-mad-enable -cl-single-precision-constant
Segmentation fault

Jul 06 '17 08:07 psyhtest

This can be reproduced as follows.

Install CK-Caffe.
Install the ViennaCL master package:

$ ck install package:lib-viennacl-master-src --target_os=android21-arm64
...
Environment entry added (c3153bc343548550)!

Recording CK configuration to /home/anton/CK_TOOLS/lib-viennacl-src-master-android21-arm64/ck-install.json ...

Installation path: /home/anton/CK_TOOLS/lib-viennacl-src-master-android21-arm64

Installation time: 3.81807088852 sec.

$ ck load env:c3153bc343548550 | grep path_include
      "path_include": "/home/anton/CK_TOOLS/lib-viennacl-src-master-android21-arm64/src",

Apply the ViennaCL patch to the sources to enable more verbose debug output:

$ patch \
-d /home/anton/CK_TOOLS/lib-viennacl-src-master-android21-arm64/src \
-p1 < ~/Downloads/issue69.viennacl.patch.txt
patching file viennacl/ocl/context.hpp

Apply the ViennaCL meta patch to the metadata to disable ViennaCL kernel caching:

$ ck find env:c3153bc343548550
/home/anton/CK_REPOS/local/env/c3153bc343548550
$ patch \
  /home/anton/CK_REPOS/local/env/c3153bc343548550/.cm/meta.json \
  ~/Downloads/issue69.viennacl-meta.patch.txt
patching file /home/anton/CK_REPOS/local/env/c3153bc343548550/.cm/meta.json

Install the Caffe with libDNN+ViennaCL package (using the ViennaCL environment installed in steps 1-3):

$ ck install package:lib-caffe-bvlc-opencl-libdnn-viennacl-universal --target_os=android21-arm64
...
Environment entry added (69031a24319b37f1)!

Recording CK configuration to /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64/ck-install.json ...

Installation path: /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64

Installation time: 198.47616601 sec.

(NB: To save time, you can interrupt the installation straight after the cloning.)

Apply the Greentea patch:

$ patch \
-d /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64/src \
-p1 < ~/Downloads/issue69.greentea.patch.txt 
patching file src/caffe/greentea/libdnn.cpp
patching file src/caffe/greentea/libdnn_conv.cpp

Rebuild with ViennaCL debug output enabled (answer y several times when prompted):

$ ck install package:lib-caffe-bvlc-opencl-libdnn-viennacl-universal \
--target_os=android21-arm64 --rebuild --env.CK_VIENNACL_DEBUG=ON
...
Environment entry updated (69031a24319b37f1)!

Recording CK configuration to /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64/ck-install.json ...

Installation path: /home/anton/CK_TOOLS/lib-caffe-bvlc-opencl-libdnn-viennacl-master-android-ndk-4.9.x-android21-arm64

Installation time: 230.220377922 sec.

$ ck show env --tags=lib,caffe,vlibdnn
Env UID:         Target OS:      Bits: Name:                                         Version:       Tags:

69031a24319b37f1 android21-arm64    64 BVLC Caffe framework (opencl,libdnn,viennacl) master-73221fd 64bits,bvlc,caffe,host-os-linux-64,lib,target-os-android21-arm64,v0,v0.0,vlibdnn,vmaster,vopencl
$ ck find env:69031a24319b37f1
/home/anton/CK_REPOS/local/env/69031a24319b37f1

Compile the caffe-time-opencl program (using the Caffe environment installed in steps 4-6):

$ ck compile program:caffe-time-opencl --target_os=android21-arm64
...
Compilation time: 4.500 sec.; Object size: 1051896; MD5: ca181afe324feb7efe04ad9dcc394961

Install the SqueezeNet 1.1 model to reproduce the failure as per the log:

$ ck install package:caffemodel-deepscale-squeezenet-1.1
$ ck show env --tags=caffemodel,squeezenet,v1.1
Env UID:         Target OS: Bits: Name:                                                      Version: Tags:

933792a5a18249eb   linux-64    64 Caffe model (net and weights) (deepscale, squeezenet, 1.1) 1.1      64bits,bvlc,caffe,caffemodel,deepscale,host-os-linux-64,net,squeezenet,target-os-linux-64,v1,v1.1,weights

Run the caffe-time-opencl program (with the device connected via adb, selecting the model installed in step 8 if prompted):

$ ck run program:caffe-time-opencl --target_os=android21-arm64 --cmd_key=default \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.CK_CAFFE_SKIP_BACKWARD

See the log:

$ ck find program:caffe-time-opencl
/home/anton/CK_REPOS/ck-caffe/program/caffe-time-opencl
$ cat /home/anton/CK_REPOS/ck-caffe/program/caffe-time-opencl/tmp/stdout.log

Jul 06 '17 08:07 psyhtest

Ok. Now it seems the same context is used for LibDNN kernels, right? This begs the question what is contained in the trace of the segfault now.

Jul 06 '17 09:07 naibaf7

Two different contexts used for the main kernels and the libDNN kernels (what I reported to you via Skype) must have been a fidget of my imagination, sorry. I now see the contexts are always the same.

But the driver segfaults always in the same place. This is weird because by disabling ViennaCL caching I effectively ensured that the driver is fed the libDNN program as source via clCreateProgramWithSource() and then immediately the resulting binary via clBuildProgram().

I think I should report this issue to Qualcomm. Even if the libDNN program is ill-formed (unlikely due to clCreateProgramWithSource() being happy with it, as well as many other implementations I tested it with), the driver must give an error message, not segfault.

For reference, here's the driver info:

Platform ID: 0
Device ID: 0
Device: QUALCOMM Adreno(TM)
Vendor: QUALCOMM
Hardware (device) version: OpenCL 2.0 Adreno(TM) 540
Software (driver) version: OpenCL 2.0 QUALCOMM build: commit #dd296bd changeid #I7547f23799 Date: 03/29/17 Wed Local Branch:  Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.UM.5.7.C1.07.00.00.278.066 Compiler E031.32.00.01
OpenCL C version: OpenCL C 2.0 Adreno(TM) 540
Address bits: 64
Parallel compute units: 4
Work-item dimensions: 3
- max work-item size #0: 1024
- max work-item size #1: 1024
- max work-item size #2: 1024

Jul 06 '17 12:07 psyhtest

Thanks. Yes I noticed a similar problem actually with windows AMD drivers where the driver would segfault if the #pragma unroll at one point did not have an even number in it.

This is why there's this quirky line in it:

// Num tiles needs to be next higher even integer
// (due to some quirky bug in AMD OpenCL 2.0 on Windows)
LibDNN<Dtype>::add_def(ss, "v_num_tiles", "(((K - 1)/(TSK*2) + 1)*2)");

Maybe removing the #pragmas or other compiler hints from the source code will allow compilation in your case as well.

Jul 06 '17 12:07 naibaf7

I will try with:

$ cd src/caffe/greentea
$ sed -i 's/^.*ss\ <<\ \"#pragma\ unroll.*/\/\/\ #pragma\ unroll/g' libdnn_conv.cpp
$ sed -i 's/^.*ss\ <<\ \"#pragma\ unroll.*/\/\/\ #pragma\ unroll/g' libdnn_deconv.cpp

changing the code, for example, as follows:

@@ -806,7 +809,7 @@ std::string LibDNNConv<Dtype>::generate_accreg_init(
   } else {
     // Zero init
     if (dterm) {
-      ss << "#pragma unroll" << std::endl;
+// #pragma unroll
       ss << "for (int_tp wm=0; wm<WPTM/VWM; ++wm) {" << std::endl;
       if (unroll) {
         for (int i = 0; i < vwm; ++i) {

NB: One additional manual fix is required in libdnn_deconv.cpp (line 1273):

// #pragma unroll
//     << this->wg_tuner_->template get_param<int>("TSK_UNROLL") << std::endl;

Jul 06 '17 13:07 psyhtest

I've got a build failure with the above change:

ViennaCL: clCreateProgramWithSource returned 0
ViennaCL: clBuildProgram options:
ViennaCL: clBuildProgram returned -11
ViennaCL: clBuildProgram failed
Build Status = -2 ( Err = -11 )
Log: BC-src-code:441:47: error: use of undeclared identifier 'Creg'
 Cptr[globalRow * N + globalCol] = ((Dtype*)(&(Creg[wm][wn/VWN])))[wn%VWN] + v_bmul * biasval;
                                               ^
BC-src-code:446:1: error: extraneous closing brace ('}')
 }
 ^
2 diagnostic(s) generated.

This is the offending code in context with hopefully correct line numbers:

 272 void conv_forward(
 ...
 435 for (int_tp wm=0; wm<WPTM; ++wm) {
 436 int_tp globalRow = offM + tidm + wm * RTSM;
 437 Dtype biasval = Dptr[globalRow];
 438 for (int_tp wn=0; wn<WPTN; ++wn) {
 439 int_tp globalCol = offN + tidn + wn * RTSN;
 440 if (globalRow < M && globalCol < N) {
 441 Cptr[globalRow * N + globalCol] = ((Dtype*)(&(Creg[wm][wn/VWN])))[wn%VWN] + v_bmul * biasval;
 442 }
 443 }
 444 }
 445 }
 446 }

Jul 06 '17 13:07 psyhtest

Hm just removing the pragmas can't cause that, there must have been a case where more was done than just commenting out the pragma, i.e. the stringstream misses a bracket or similar now.

Jul 06 '17 14:07 naibaf7

Yeah, shouldn't have done but still. Please have a look at the new patch and log.

Jul 06 '17 14:07 psyhtest

Hmm, I've made a deliberate typo in the source to print the original program, and, indeed, apart from the removed #pragma unroll, the modified program has these additional lines:

> for (int_tp wn=0; wn<WPTN/VWN; ++wn) {
> for (int_tp wm=0; wm<WPTM/VWM; ++wm) {
> Asub[(tidn+wn*RTSN)*VWN + 0][(tidm + wn*RTSN)*VWM + 0] = VEC_4_0(Creg[wn + 0][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 0][(tidm + wn*RTSN)*VWM + 1] = VEC_4_1(Creg[wn + 0][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 0][(tidm + wn*RTSN)*VWM + 2] = VEC_4_2(Creg[wn + 0][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 0][(tidm + wn*RTSN)*VWM + 3] = VEC_4_3(Creg[wn + 0][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 1][(tidm + wn*RTSN)*VWM + 0] = VEC_4_0(Creg[wn + 1][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 1][(tidm + wn*RTSN)*VWM + 1] = VEC_4_1(Creg[wn + 1][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 1][(tidm + wn*RTSN)*VWM + 2] = VEC_4_2(Creg[wn + 1][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 1][(tidm + wn*RTSN)*VWM + 3] = VEC_4_3(Creg[wn + 1][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 2][(tidm + wn*RTSN)*VWM + 0] = VEC_4_0(Creg[wn + 2][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 2][(tidm + wn*RTSN)*VWM + 1] = VEC_4_1(Creg[wn + 2][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 2][(tidm + wn*RTSN)*VWM + 2] = VEC_4_2(Creg[wn + 2][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 2][(tidm + wn*RTSN)*VWM + 3] = VEC_4_3(Creg[wn + 2][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 3][(tidm + wn*RTSN)*VWM + 0] = VEC_4_0(Creg[wn + 3][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 3][(tidm + wn*RTSN)*VWM + 1] = VEC_4_1(Creg[wn + 3][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 3][(tidm + wn*RTSN)*VWM + 2] = VEC_4_2(Creg[wn + 3][wm]);
> Asub[(tidn+wn*RTSN)*VWN + 3][(tidm + wn*RTSN)*VWM + 3] = VEC_4_3(Creg[wn + 3][wm]);
> }
> }
> }
> barrier(CLK_LOCAL_MEM_FENCE);
> {
> Dtype4 Creg;
> for (int_tp lc = 0; lc < ((TSM*TSN-1)/(RTSM*RTSN))/VWM+1; ++lc) {
> int_tp tid = tidm * RTSN + tidn;
> int_tp id = lc * RTSN * RTSM + tid;
> int_tp row = (id / TSN) * VWM;
> int_tp col = id % TSN;
> int_tp globalRow = offM + row;
> int_tp globalCol = offN + col;
> VEC_4_0(Creg) = Asub[col][row + 0];
> if ((globalRow +0) < M && globalCol < N) {
> Cptr[(globalRow +0) * N + globalCol] = VEC_4_0(Creg) + Dptr[globalRow +0];
> }
> VEC_4_1(Creg) = Asub[col][row + 1];
> if ((globalRow +1) < M && globalCol < N) {
> Cptr[(globalRow +1) * N + globalCol] = VEC_4_1(Creg) + Dptr[globalRow +1];
> }
> VEC_4_2(Creg) = Asub[col][row + 2];
> if ((globalRow +2) < M && globalCol < N) {
> Cptr[(globalRow +2) * N + globalCol] = VEC_4_2(Creg) + Dptr[globalRow +2];
> }
> VEC_4_3(Creg) = Asub[col][row + 3];
> if ((globalRow +3) < M && globalCol < N) {
> Cptr[(globalRow +3) * N + globalCol] = VEC_4_3(Creg) + Dptr[globalRow +3];
> }
> }
> }

It seems that the first loop nest is doubly nested but it's terminated with three braces. This is where the brace imbalance may come from. But I'm lost to why removing an unroll pragma has this effect of introducing additional code.

Jul 07 '17 07:07 psyhtest

It seems that you accidentally uncommented the lines 1050 to 1072? That would be my guess. It's this part:

  // Store the final results in C
  /*ss << "#pragma unroll 1" << std::endl;
  ss << "for (int_tp wn=0; wn<WPTN/VWN; ++wn) {" << std::endl;
  ss << "#pragma unroll" << std::endl;
  ss << "for (int_tp wm=0; wm<WPTM/VWM; ++wm) {" << std::endl;
  for (int j = 0; j < vwn; ++j) {
    for (int i = 0; i < vwm; ++i) {
      ss << "Asub[(tidn+wn*RTSN)*VWN + " << j << "][(tidm + wn*RTSN)*VWM + " << i << "] = VEC_" << vwm << "_" << i << "(Creg[wn + " << j << "][wm]);" << std::endl;
    }
  }
  ss << "}" << std::endl;
  ss << "}" << std::endl;
  ss << "}" << std::endl;  // Scoping for C registers

  ss << "barrier(CLK_LOCAL_MEM_FENCE);" << std::endl;

  // Store the final results in C
  ss << "{" << std::endl; // Scoping for storing C
  ss << "Dtype" << vwm << " Creg;" << std::endl;
  ss << "#pragma unroll 1" << std::endl;
  ss << "for (int_tp lc = 0; lc < ((TSM*TSN-1)/(RTSM*RTSN))/VWM+1; ++lc) {" << std::endl;
  ss << "int_tp tid = tidm * RTSN + tidn;" << std::endl;
  ss << "int_tp id = lc * RTSN * RTSM + tid;" << std::endl;
  ss << "int_tp row = (id / TSN) * VWM;" << std::endl;
  ss << "int_tp col = id % TSN;" << std::endl;
  ss << "int_tp globalRow = offM + row;" << std::endl;
  ss << "int_tp globalCol = offN + col;" << std::endl;
  for (int i = 0; i < vwm; ++i) {
    ss << "VEC_" << vwm << "_" << i << "(Creg) = Asub[col][row + " << i << "];" << std::endl;
    ss << "if ((globalRow +" << i << ") < M && globalCol < N) {" << std::endl;
    if (bias_term_) {
      ss << "Cptr[(globalRow +" << i << ") * N + globalCol] = VEC_" << vwm << "_" << i << "(Creg) + Dptr[globalRow +" << i << "];" << std::endl;
    } else {
      ss << "Cptr[(globalRow +" << i << ") * N + globalCol] = VEC_" << vwm << "_" << i << "(Creg);" << std::endl;
    }
    ss << "}" << std::endl;
  }
  ss << "}" << std::endl;
  ss << "}" << std::endl; // Scoping for storing C*/

Jul 07 '17 09:07 naibaf7

Ha, that would explain it, thanks! My regex wasn't good enough. I'll try again.

Jul 07 '17 14:07 psyhtest

With the new regex only replacing #pragma printers starting with whitespace (Greentea patch):

$ sed -i  's/^\s*ss\ <<\ \"#pragma\ unroll.*/\/\/\ #pragma\ unroll/g' libdnn_conv.cpp
$ sed -i  's/^\s*ss\ <<\ \"#pragma\ unroll.*/\/\/\ #pragma\ unroll/g' libdnn_deconv.cpp

I produced a program without #pragmas. Unfortunately, the driver still segfaults.

I've sent the original program to the top OpenCL guys at Qualcomm. Will report back any workarounds they suggest.

Jul 07 '17 19:07 psyhtest

Ok thanks a lot. I currently have no other suggestion as to what could go wrong.

Jul 07 '17 19:07 naibaf7

I've got a reply from Qualcomm which they allowed me to share here:

To work around the issue, please change arrays of float4 vectors to arrays of scalar float, whenever the array is used within a loop.

For example, in line 285 of issue69.libdnn.cl, you have:

Dtype4 Creg[WPTM][WPTN/VWN];

This can be changed to:

Dtype Creg[WPTM][WPTN/VWN * 4];

And subsequent use of Creg will need to be modified to reflect the change from Dtype4 to Dtype.

The issue occurs when such vector array is used in a loop, so if you have an array of vector that's not used in a loop, the issue will not happen. This can happen whether the array is declared as private memory within the kernel or whether it is passed in as kernel argument.

Jul 28 '17 12:07 psyhtest

@naibaf7 I don't suppose you want to merge the above workaround to libDNN? I could probably try it on a separate branch if you would suggest me where to put these changes in.

Jul 28 '17 12:07 psyhtest

@psyhtest Thanks this is great to know. No I don't want to hard-code that into the code, but there are actually tuning parameters that change the vectorization data type, so if I remember my own code correctly, setting the LibDNN internal tuning parameters correctly should allow to compile it.

I'll also consider testing vector data access as a part of the pre-tuning phase of LibDNN then...

Jul 28 '17 16:07 naibaf7

caffe caffe copied to clipboard

clBuildProgram segfaults when building libDNN kernels on Snapdragon 835

caffe
caffe copied to clipboard