VC4CL icon indicating copy to clipboard operation
VC4CL copied to clipboard

Running final binary as sudo

Open alexzk1 opened this issue 6 years ago • 82 comments

...and getting

INFO:0] Initialize OpenCL runtime... [ INFO:0] Successfully initialized OpenCL cache directory: /root/.cache/opencv/3.4.1/opencl_cache/ [ INFO:0] Preparing OpenCL cache configuration for context: 32-bit--Broadcom--VideoCore_IV_GPU--0_4 OpenCV(3.4.1) Error: Unknown error code -220 (OpenCL error CL_INVALID_VALUE (-30) during call: clGetProgramInfo(handle, CL_PROGRAM_BINARIES, sizeof(ptr), &ptr, NULL)) in getProgramBinary, file /build/opencv/src/opencv-3.4.1/modules/core/src/ocl.cpp, line 3752 [ WARN:0] Can't save OpenCL binary into cache: /root/.cache/opencv/3.4.1/opencl_cache/32-bit--Broadcom--VideoCore_IV_GPU--0_4/imgproc--filterSepRow_e99b92fca8604fe253f3c641802ce117.bin OpenCV(3.4.1) /build/opencv/src/opencv-3.4.1/modules/core/src/ocl.cpp:3752: error: (-220) OpenCL error CL_INVALID_VALUE (-30) during call: clGetProgramInfo(handle, CL_PROGRAM_BINARIES, sizeof(ptr), &ptr, NULL) in function getProgramBinary

OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54) during call: clEnqueueNDRangeKernel('row_filter_C1_D0', dims=2, globalsize=240x544x1, localsize=16x16x1) sync=false OpenCV(3.4.1) Error: Unknown error code -220 (OpenCL error CL_INVALID_VALUE (-30) during call: clGetProgramInfo(handle, CL_PROGRAM_BINARIES, sizeof(ptr), &ptr, NULL)) in getProgramBinary, file /build/opencv/src/opencv-3.4.1/modules/core/src/ocl.cpp, line 3752 [ WARN:0] Can't save OpenCL binary into cache: /root/.cache/opencv/3.4.1/opencl_cache/32-bit--Broadcom--VideoCore_IV_GPU--0_4/imgproc--filterSepRow_e99b92fca8604fe253f3c641802ce117.bin OpenCV(3.4.1) /build/opencv/src/opencv-3.4.1/modules/core/src/ocl.cpp:3752: error: (-220) OpenCL error CL_INVALID_VALUE (-30) during call: clGetProgramInfo(handle, CL_PROGRAM_BINARIES, sizeof(ptr), &ptr, NULL) in function getProgramBinary

OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54) during call: clEnqueueNDRangeKernel('row_filter_C1_D0', dims=2, globalsize=240x544x1, localsize=16x16x1) sync=false FPS 0.0205593, Objects: 0 OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54) during call: clEnqueueNDRangeKernel('row_filter_C1_D0', dims=2, globalsize=240x544x1, localsize=16x16x1) sync=false OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54) during call: clEnqueueNDRangeKernel('row_filter_C1_D0', dims=2, globalsize=240x544x1, localsize=16x16x1) sync=false

Is it problem of what ? >: Not supported features, rights, opencl? Made just in case

sudo chmod 777 /root/.cache/opencv/3.4.1/opencl_cache/32-bit--Broadcom--VideoCore_IV_GPU--0_4

and didnt work

alexzk1 avatar Jul 05 '18 11:07 alexzk1

You get two different error codes:

OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54)

OpenCV seems to not be able to handle small work-group sizes, or at least has a lower bound larger than the supported work-group size of VC4CL.

Error: Unknown error code -220

This seem to be that the call to clGetProgramInfo failed with somehow both the error code CL_INVALID_VALUE (-30) and an unknown error code -220.

doe300 avatar Jul 05 '18 12:07 doe300

Fails here:Mailbox::executeQPU on sending ioctl, everything hangs as

[VC4CL] Running work-group 0, 0, 0 [VC4CL] Mailbox buffer before: [VC4CL] 0000: 0x00000028 [VC4CL] 0004: 0x00000000 [VC4CL] 0008: 0x00030011 [VC4CL] 000c: 0x00000010 [VC4CL] 0010: 0x00000010 [VC4CL] 0014: 0x0000000c [VC4CL] 0018: 0x5e10df58 [VC4CL] 001c: 0x00000000 [VC4CL] 0020: 0x00007530 [VC4CL] 0024: 0x751687cc

> Code is:
>  R"CLC(
>         __kernel void TestMagic(const int total, const int is_deeper_magic, const float alpha_s, const float fore_th, __global const float* gradx, __global const float* grady,
>                                                  //in/out
>                                                  __global float* BSx,  __global float* BSy, __global int* mapRes)
>         {
>             private const size_t i        = get_global_id(0);
>             private const size_t gpu_used = get_global_size(0);
> 
>             private const size_t elements_count = total / (gpu_used * 16);
>             private const size_t offset = i * total / gpu_used;
> 
>             for (size_t k = 0; k < elements_count; ++k)
>             {
>                int16 mr           = vload16( k , mapRes + offset);
>                const int16 twos   = 2;
>                mr += twos;
>                vstore16(mr, k, mapRes + offset);
>             }
>         }
>         )CLC",
> > 

During compilation:

[VC4CL] Precompiling source with: [VC4CL] Precompilation complete with status: 0 [VC4CL] Compilation log: [W] Sat Jul 7 03:04:50 2018: Warnings in precompilation: [W] Sat Jul 7 03:04:50 2018: :90:9: warning: null character ignored <U+0000> ^ 1 warning generated.

Any ideas? >: QPULib examples work...(https://github.com/mn416/QPULib)

alexzk1 avatar Jul 07 '18 00:07 alexzk1

img_2906

alexzk1 avatar Jul 07 '18 02:07 alexzk1

Sometimes, if the code generated for a kernel does something completely wrong, then the QPUs get into a hanged state. Looks like this is happening here

doe300 avatar Jul 07 '18 07:07 doe300

and how to deal with? is it compiler problem or my code?

alexzk1 avatar Jul 07 '18 13:07 alexzk1

Most likely a compiler problem. I found an issue which also occurs in the kernel given, and will try to fix it.

doe300 avatar Jul 07 '18 13:07 doe300

what exactly ? I could change my kernel maybe to skip this problem.

alexzk1 avatar Jul 07 '18 13:07 alexzk1

This commit does not fix issue, still hangs https://github.com/doe300/VC4C/commit/f13120debba4d59da6b8ca432a22eaf64e338550

But confirmed - it is wrong division, modified kernel works like a charm:

__kernel void TestMagic(const int total, const int is_deeper_magic, const float alpha_s, const float fore_th, __global const float* gradx, __global const float* grady,
                                                 //in/out
                                                 __global float* BSx,  __global float* BSy, __global int* mapRes)
        {
            private const size_t i        = get_global_id(0);
            private const size_t gpu_used = get_global_size(0);

            private const size_t elements_count = 1;//total / (gpu_used * 16);
            private const size_t offset = i * total ;/// gpu_used;

            for (size_t k = 0; k < elements_count; ++k)
            {
               int16 mr           = vload16( k , mapRes + offset);
               const int16 twos   = 2;
               mr += twos;
               vstore16(mr, k, mapRes + offset);
            }
        }

alexzk1 avatar Jul 09 '18 09:07 alexzk1

Yeah, the integer division (and modulo) is still wrong. I could not figure out the reason for this yesterday.

doe300 avatar Jul 09 '18 10:07 doe300

does it mean float div will work? I need that for atan2, which contains same integer div (for offset) and inside atan2 y/x ... btw in you souce it is some comment there, maybe you do x/y ?

alexzk1 avatar Jul 09 '18 10:07 alexzk1

Floating point division works, at least for the tests I ran, with worse-than-allowed accuracy in some cases. I don't know if the atan2 function works..

doe300 avatar Jul 09 '18 10:07 doe300

Yes, thats weird. Opencl on nvidia give a bit different results, then opencv-default(cpu?) does: 1st is opencv-default, 2nd is opencl-nvidia kernel:

DUMP: x = 0; y = 2; a = 1.5708 x = -4; y = 4; a = 2.35636 x = 6; y = 8; a = 0.927403 x = 21; y = 13; a = 0.554224 x = 18; y = 16; a = 0.726774 x = 12; y = 6; a = 0.463683 x = 15; y = -13; a = 5.569 x = 8; y = -22; a = 5.06123 x = -9; y = -11; a = 4.02667 x = -11; y = 11; a = 2.35636 x = -2; y = 18; a = 1.68144 x = 4; y = 4; a = 0.785232 x = 4; y = -12; a = 5.0342 x = 7; y = -13; a = 5.20633 x = 8; y = 0; a = 0 x = 0; y = 10; a = 1.5708 DUMP: x = 0; y = 2; a = 1.5708 x = -4; y = 4; a = 2.35619 x = 6; y = 8; a = 0.927295 x = 21; y = 13; a = 0.554308 x = 18; y = 16; a = 0.726642 x = 12; y = 6; a = 0.463648 x = 15; y = -13; a = 5.56909 x = 8; y = -22; a = 5.06116 x = -9; y = -11; a = 4.02666 x = -11; y = 11; a = 2.35619 x = -2; y = 18; a = 1.68145 x = 4; y = 4; a = 0.785398 x = 4; y = -12; a = 5.03414 x = 7; y = -13; a = 5.20633 x = 8; y = 0; a = 0 x = 0; y = 10; a = 1.5708

Kernel is

private float16 x = vload16( k , gradx + offset);
               private float16 y = vload16( k , grady + offset);
               float16 a = atan2(y, x);
               a = select(a, a + 2 * (float)M_PI, a < 0);
               vstore16(a, k, radians + offset);

On RPI something is broken completely:

DUMP: x = 0; y = 2; a = 1.5708 x = -4; y = 4; a = 2.35636 x = 6; y = 8; a = 0.927403 x = 21; y = 13; a = 0.554224 x = 18; y = 16; a = 0.726774 x = 12; y = 6; a = 0.463683 x = 15; y = -13; a = 5.569 x = 8; y = -22; a = 5.06123 x = -9; y = -11; a = 4.02667 x = -11; y = 11; a = 2.35636 x = -2; y = 18; a = 1.68144 x = 4; y = 4; a = 0.785232 x = 4; y = -12; a = 5.0342 x = 7; y = -13; a = 5.20633 x = 8; y = 0; a = 0 x = 0; y = 10; a = 1.5708 DUMP: x = 0; y = 2; a = 0 x = -4; y = 4; a = -0.785399 x = 6; y = 8; a = 0.927296 x = 21; y = 13; a = 0.554308 x = 18; y = 16; a = 0.726643 x = 12; y = 6; a = 0.463648 x = 15; y = -13; a = -0.714091 x = 8; y = -22; a = -1.22203 x = -9; y = -11; a = 0.885067 x = -11; y = 11; a = -0.785399 x = -2; y = 18; a = -1.46014 x = 4; y = 4; a = 0.785399 x = 4; y = -12; a = -1.24905 x = 7; y = -13; a = -1.07686 x = 8; y = 0; a = 0 x = 0; y = 10; a = 0

alexzk1 avatar Jul 09 '18 13:07 alexzk1

Used nvidia reference implementation (http://developer.download.nvidia.com/cg/atan2.html),on desktop it's ok.

float16 myatan2(float16 y, float16 x)
        {
          float16 t0, t1, t2, t3, t4;

          t3 = fabs(x);
          t1 = fabs(y);
          t0 = max(t3, t1);
          t1 = min(t3, t1);
          t3 = 1.f / t0;
          t3 = t1 * t3;

          t4 = t3 * t3;
          t0 =         - 0.013480470f;
          t0 = t0 * t4 + 0.057477314f;
          t0 = t0 * t4 - 0.121239071f;
          t0 = t0 * t4 + 0.195635925f;
          t0 = t0 * t4 - 0.332994597f;
          t0 = t0 * t4 + 0.999995630f;
          t3 = t0 * t3;

          t3 = (fabs(y) > fabs(x)) ? 1.570796327f - t3 : t3;
          t3 = (x < 0) ?  3.141592654f - t3 : t3;
          t3 = (y < 0) ? -t3 : t3;

          return t3;
        }

On RPI it works as well:

DUMP: x = 0; y = 2; a = 1.5708 x = -4; y = 4; a = 2.35636 x = 6; y = 8; a = 0.927403 x = 21; y = 13; a = 0.554224 x = 18; y = 16; a = 0.726774 x = 12; y = 6; a = 0.463683 x = 15; y = -13; a = 5.569 x = 8; y = -22; a = 5.06123 x = -9; y = -11; a = 4.02667 x = -11; y = 11; a = 2.35636 x = -2; y = 18; a = 1.68144 x = 4; y = 4; a = 0.785232 x = 4; y = -12; a = 5.0342 x = 7; y = -13; a = 5.20633 x = 8; y = 0; a = 0 x = 0; y = 10; a = 1.5708 DUMP: x = 0; y = 2; a = 1.5708 x = -4; y = 4; a = 2.3562 x = 6; y = 8; a = 0.927294 x = 21; y = 13; a = 0.554308 x = 18; y = 16; a = 0.72664 x = 12; y = 6; a = 0.463646 x = 15; y = -13; a = -0.714088 x = 8; y = -22; a = -1.22203 x = -9; y = -11; a = -2.25652 x = -11; y = 11; a = 2.3562 x = -2; y = 18; a = 1.68145 x = 4; y = 4; a = 0.785395 x = 4; y = -12; a = -1.24905 x = 7; y = -13; a = -1.07686 x = 8; y = 0; a = 0 x = 0; y = 10; a = 1.5708

Almost...negatives may mean it does not do "select" properly tooin this line: a = select(a, a + pi2, a < 0);

alexzk1 avatar Jul 09 '18 13:07 alexzk1

Opencl on nvidia give a bit different results, then opencv-default

Depending on the compilation flags you specified, the NVIDIA code may use faster but inaccurate operations (e.g. due to -cl-fast-relaxed-math or -cl-mad-enable).

On RPI something is broken completely:

This looks like the atan2 function yielding wrong results. Some/a lot of the math functions in VC4CL are not correct or properly tested.

Used nvidia reference implementation [...]

Thanks for the link, this might come in very handy, if I can figure out its license...

doe300 avatar Jul 09 '18 14:07 doe300

Still can't make "select" work >: is it broken too ? Bcs I was dependant on in another kernel as well (that initial I simplified for test). It seems it just do nothing.

alexzk1 avatar Jul 09 '18 14:07 alexzk1

Depending on the compilation flags you specified, the NVIDIA code may use faster but inaccurate operations (e.g. due to -cl-fast-relaxed-math or -cl-mad-enable).

I was using progs.build("-cl-opt-disable");

alexzk1 avatar Jul 09 '18 14:07 alexzk1

Okey, replaced "select" by a = fmod(a + pi2, pi2); But that is "half-solution" actually, because how to do logic yet? >:

alexzk1 avatar Jul 09 '18 14:07 alexzk1

Ok made custom select for float16, for integers must be even simplier:

 float16 myselect(float16 afalse, float16 atrue, int16 condition)
        {
            //we have -1 = true in condition ...it should be so
            float16 cond = convert_float16(condition) * -1.f;
            float16 not_cond = 1.f - cond;
            return atrue * cond + afalse * not_cond;
        }
int16 myselecti16(int16 afalse, int16 atrue, int16 condition)
        {
            //we have -1 = true in condition ...it should be so
            int16 cond     = -1 * condition;
            int16 not_cond = 1 - cond;
            return atrue * cond + afalse * not_cond;
        }

btw this works 10 times faster, in original select i had to do convert_int16 because result was float16 on rpi, and that was 200ms instead 17ms now for kernel on nvidia. On RPI though no difference on speed (still not sure if original select was working at all, most likely not)

alexzk1 avatar Jul 09 '18 14:07 alexzk1

Do I need to rebuild compiler and vc4cl if stdlib updated? or just reinstall stdlib (including pch) ?

alexzk1 avatar Jul 09 '18 19:07 alexzk1

You need to re-trigger building of the PCH and BC for vc4cl-stdlib, which is done by the script located in ${CMAKE_BINARY_DIR}/build/postinst from the VC4C project for installed stdlib and make vc4cl-stdlib for stdlib sources (in which case you have to delete <VC4CLStdLib-root>/include/VC4CLStdLib.h.pch and <VC4CLStdLib-root>/include/VC4CLStdLib.bc).

doe300 avatar Jul 09 '18 19:07 doe300

Ok..so no compiler rebuild if pch is built as separated package like here https://github.com/alexzk1/vc4_stdlib_arch/blob/master/PKGBUILD ?

Btw, building VC4C gives many such warnings, is it ok ? (gcc 8) usr/include/c++/8.1.0/bits/stl_vector.h: In member function 'std::vector<exprValue> Parser::parseArgumentList(const string&, size_t)': /usr/include/c++/8.1.0/bits/stl_vector.h:1085:4: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator<exprValue*, std::vector<exprValue> >' changed in GCC 7.1 In file included from /usr/include/c++/8.1.0/vector:69,

Also:

: warning: "_GNU_SOURCE" redefined : note: this is the location of the previous definition

I think that u kinda missing inline or so ... or maybe #ifdef or #undef

alexzk1 avatar Jul 09 '18 19:07 alexzk1

Well, the warnings are both libstdc++ internal warnings. I can try to disable the warning, but I don't think I can do anything about fixing them.

doe300 avatar Jul 10 '18 06:07 doe300

fmod(float16, float16) is still broken - pi hangs. And maybe something else, replaced fmod by equivalent, but...on next run it was ok, and next-next run hanged again...

Got some error more (works on desktop):

Failed to compiler kernels: [E] Fri Jul 13 19:39:36 2018: Error assigning local to register: %vecinit33 [E] Fri Jul 13 19:39:36 2018: Error assigning local to register: %vecinit52 [E] Fri Jul 13 19:39:37 2018: Background worker threw error: Label/Register Mapping: There are erroneous register-associations! [E] Fri Jul 13 19:39:37 2018: While running worker task: CodeGenerator [E] Fri Jul 13 19:39:40 2018: Compiler threw exception: Label/Register Mapping: There are erroneous register-associations!

alexzk1 avatar Jul 13 '18 15:07 alexzk1

img_2908 No kernel panic, but that red dots around... maybe because connected monitor? And hang. I'm trying to do chained kernels calls (from C++) with preserving __global buffer between calls - out from 1st comes as input to 2nd.

Ok, it seems broken islessequal(float16, float16) and isgreaterequal(float16, float16) - replaced by isless(float16, float16) and code works a bit longer...>: and still hangs. Not sure. Okey, seems I narrowed down the problem:

This code fails:

                atest =  isless(fabs(angle - pi2), pi8); //90 not sure why, but this works better 90 = up/left
                 p1 = myselectf16(p1, Z4, atest);
                 p2 = myselectf16(p2, Z6, atest);

This works:

atest =  isless(angle, pi8) ; 
                 p1 = myselectf16(p1, Z2, atest);
                 p2 = myselectf16(p2, Z8, atest);

Zs are:

#define Z1 ((float16)(a, b.s0123, b.s456789ab, b.scde))
        #define Z2 (b)
        #define Z3 (float16) (b.s123, b.s4567, b.s89abcdef, c)
        #define Z4 ((float16)(d, e.s0123, e.s456789ab, e.scde))
        #define Z5 (e)
        #define Z6 (float16)(e.s123, e.s4567, e.s89abcdef, f)
        #define Z7 (float16)(g, h.s0123, h.s456789ab, h.scde)
        #define Z8 (h)
        #define Z9 (float16)(h.s123, h.s4567, h.s89abcdef, i)

alexzk1 avatar Jul 13 '18 17:07 alexzk1

Seems something wrong with optimizations, changed code to

float16 delta = fabs(angle - pi2);

              atest =  isless(delta, pi8); //90 not sure why, but this works better 90 = up/left
              p1 = myselectf16(p1, Z4, atest);
              p2 = myselectf16(p2, Z6, atest);

              delta = fabs(angle - pi4);
              atest =  isless(delta, pi8); //45
              p1 = myselectf16(p1, Z3, atest);
              p2 = myselectf16(p2, Z7, atest);

              delta = fabs(angle - pi1);
              atest =  isless(angle, pi8) || isless(delta, pi8); //0
              p1 = myselectf16(p1, Z2, atest);
              p2 = myselectf16(p2, Z8, atest);


              delta = fabs(angle - pi34);
              atest =  isless(delta, pi8); //135
              p1 = myselectf16(p1, Z1, atest);
              p2 = myselectf16(p2, Z9, atest);

And it fails:

%call.i = tail call spir_func i32 bitcast (i32 ()* @myfabs to i32 (<16 x float>)*)(<16 x float> %call8) #9Failed to compiler kernels: [E] Fri Jul 13 23:24:38 2018: Compiler threw exception: Parser: Unhandled type of indirect function call!

alexzk1 avatar Jul 13 '18 20:07 alexzk1

holdon...why it says "myfabs", I have removed that func and it's not on disk at all in sources. Ok, found..."myfabs" - that what I left (not deleted)...variant with "delta" variable still hangs.

alexzk1 avatar Jul 13 '18 21:07 alexzk1

Did such macro, hanged immediately without any 1 success step

#define Z1 ((float16)(a, b.s0123, b.s456789ab, b.scde)) #define Z2 (b) #define Z3 ((float16) (b.s123, b.s4567, b.s89abcdef, c)) #define Z4 ((float16)(d, e.s0123, e.s456789ab, e.scde)) #define Z5 (e) #define Z6 ((float16)(e.s123, e.s4567, e.s89abcdef, f)) #define Z7 ((float16)(g, h.s0123, h.s456789ab, h.scde)) #define Z8 (h) #define Z9 ((float16)(h.s123, h.s4567, h.s89abcdef, i))

Hangs without extarnal () as well.. damn..so why...cant find exact reason >:

alexzk1 avatar Jul 13 '18 22:07 alexzk1

Okey, this seems working - function calls without macroses as parameters, code is wrong, but works, so how is it possible? >:

 p1 = myselectf16(p1, p2, atest);
        p2 = myselectf16(p2, p1, atest);

        atest =  isless(fabs(angle - pi4), pi8); //45
        p1 = myselectf16(p1, p2, atest);
        p2 = myselectf16(p2, p1, atest);

        atest =  islessequal(angle, pi8) || islessequal(fabs(angle - pi1), pi8); //0
        p1 = myselectf16(p1, p2, atest);
        p2 = myselectf16(p2, p1, atest);


        atest =  isless(fabs(angle - pi34), pi8); //135
        p1 = myselectf16(p1, p2, atest);
        p2 = myselectf16(p2, p1, atest);

In another kernel I had usage of those Z's in ariphmetics like float16 Gx = (Z7 + 2 * Z8 + Z9) - (Z1 + 2 * Z2 +Z3);

and it is working ok.

Also in this same example I had another line below vstore16(myselectf16(0, Z5, isless(p2, Z5) && isless(p1, Z5)), 0, ( __global float*)(N + dstPaddedIndex));

and it worked yet...how?

alexzk1 avatar Jul 14 '18 00:07 alexzk1

And maybe something else, replaced fmod by equivalent, but...on next run it was ok, and next-next run hanged again...

Did you reboot your Pi in between runs? Sometimes when the GPU hangs, it can only recovered by rebooting it.

No kernel panic, but that red dots around... maybe because connected monitor?

It is possible, that the graphics driver collides with VC4CL, at least if VC4CL is compiled with REGISTER_POKE_KERNELS on. If the flag is disabled, the VC4CL uses the kernel synchronization routines, the graphics driver uses too (but see #15, idk if this is still valid).

Okey, this seems working - function calls without macroses as parameters, code is wrong, but works, so how is it possible?

One theory: There is something wrong with vector shuffling (putting together vector from pieces of other vector), which you do in the macros.

doe300 avatar Jul 14 '18 11:07 doe300

One theory: There is something wrong with vector shuffling (putting together vector from pieces of other vector), which you do in the macros.

Yes, possibly. Assignign like float16 z1 = Z1; and using variable didnt work too. However as I said other kernels were working. But not sure if gave correct results actually.

Did you reboot your Pi in between runs? Sometimes when the GPU hangs, it can only recovered by rebooting it.

I power off it. It hangs completely, so have to remove cable.

alexzk1 avatar Jul 14 '18 17:07 alexzk1