pyopencl graceful closing logic?

graceful closing logic?

Open iperov opened this issue 3 years ago • 11 comments

There is lack of documentation about how to gracefully close: Memory Buffer Program Command queue Context

should I call specific method or just lose all references to objects?

Aug 24 '20 11:08 iperov

OpenCL is reference counted, so even at the OpenCL API level, the only way to free resources is to lose all references.
Since each Python object acting as a handle for an OpenCL object holds one such reference, you may want fine-grained control over when this reference is released. That's what the .release() methods throughout the API are for, e.g. here.

I'm happy to consider additions to the docs that make this clearer. Could you maybe open a PR proposing such a change?

Aug 24 '20 15:08 inducer

thx for explanation.

Aug 24 '20 15:08 iperov

sorry I have no time for doc PR.

I am trying to create pytorch-like ML framework based on pyopencl.

Aug 24 '20 15:08 iperov

sorry I have no time for doc PR.

No worries.

I am trying to create pytorch-like ML framework based on pyopencl.

Cool. Looking forward to seeing what you build.

Aug 24 '20 16:08 inducer

@inducer first benchmarks shows that simple matmul is x3 slower on opencl than tensorflow. But conv is only 25% slower in my implementation.

The bottleneck of opencl is for-loop.

Sep 17 '20 05:09 iperov

Some tips:

Add a bunch of CL build options

         "-cl-denorms-are-zero",
         "-cl-fast-relaxed-math",
         "-cl-finite-math-only",
         "-cl-mad-enable",
         "-cl-no-signed-zeros",

get_group_id() and get_local_id() return 64-bit integer, which can lead to 64-bit index math, which can be slow. CUDA only does 32-bit index math.
Ultimately, the PTX generated by CUDA and CL should be (mostly) the same. You can find the PTX from pyopencl as prg.binaries[0].

Sep 17 '20 13:09 inducer

Thanks. But nothing changed :) New tests shows that opencl conv is actually 63% slower.

Sep 17 '20 14:09 iperov

actually ML is not a hard math. It's just a multiplications and summations 99.9% of a time.

You can compare conv or matmul operation of PyTorch or Tensorflow with your own opencl kernel.

Sep 17 '20 14:09 iperov

With matmul I reached x2.75 slower than TF on opencl. Cannot get more speed :(

Sep 17 '20 14:09 iperov

Matmul is very simple

__kernel void impl(__global float* O, __global const float* A, __global const float* B)
{{
    int gid = get_global_id(0);

    int o1 = gid % O1;
    int o0 = gid / O1;
    
    float v = 0.0;
    for (int a1=0; a1<A1; ++a1)
        v += A[A_idx(o0,a1)] * B[B_idx(a1,o1)];
    O[O_idx(o0,o1)] = v;
}}

also I tried local memory of group of threads, but it gives zero boost.

Sep 17 '20 14:09 iperov

found solution for matmul

for (int a1=0; a1<A1; a1 += 8)
{
        v += A[A_idx(o0,a1)] * B[B_idx(a1,o1)];
        v += A[A_idx(o0,a1+1)] * B[B_idx(a1+1,o1)];
        v += A[A_idx(o0,a1+2)] * B[B_idx(a1+2,o1)];
        v += A[A_idx(o0,a1+3)] * B[B_idx(a1+3,o1)];
        v += A[A_idx(o0,a1+4)] * B[B_idx(a1+4,o1)];
        v += A[A_idx(o0,a1+5)] * B[B_idx(a1+5,o1)];
        v += A[A_idx(o0,a1+6)] * B[B_idx(a1+6,o1)];
        v += A[A_idx(o0,a1+7)] * B[B_idx(a1+7,o1)];

with that matmul is 1ms faster than tf :D

Sep 17 '20 14:09 iperov

pyopencl pyopencl copied to clipboard

graceful closing logic?

pyopencl
pyopencl copied to clipboard