pyopencl icon indicating copy to clipboard operation
pyopencl copied to clipboard

graceful closing logic?

Open iperov opened this issue 3 years ago • 11 comments

There is lack of documentation about how to gracefully close: Memory Buffer Program Command queue Context

should I call specific method or just lose all references to objects?

iperov avatar Aug 24 '20 11:08 iperov

  • OpenCL is reference counted, so even at the OpenCL API level, the only way to free resources is to lose all references.
  • Since each Python object acting as a handle for an OpenCL object holds one such reference, you may want fine-grained control over when this reference is released. That's what the .release() methods throughout the API are for, e.g. here.

I'm happy to consider additions to the docs that make this clearer. Could you maybe open a PR proposing such a change?

inducer avatar Aug 24 '20 15:08 inducer

thx for explanation.

iperov avatar Aug 24 '20 15:08 iperov

sorry I have no time for doc PR.

I am trying to create pytorch-like ML framework based on pyopencl.

iperov avatar Aug 24 '20 15:08 iperov

sorry I have no time for doc PR.

No worries.

I am trying to create pytorch-like ML framework based on pyopencl.

Cool. Looking forward to seeing what you build.

inducer avatar Aug 24 '20 16:08 inducer

@inducer first benchmarks shows that simple matmul is x3 slower on opencl than tensorflow. But conv is only 25% slower in my implementation.

The bottleneck of opencl is for-loop.

iperov avatar Sep 17 '20 05:09 iperov

Some tips:

  • Add a bunch of CL build options
         "-cl-denorms-are-zero",
         "-cl-fast-relaxed-math",
         "-cl-finite-math-only",
         "-cl-mad-enable",
         "-cl-no-signed-zeros",
  • get_group_id() and get_local_id() return 64-bit integer, which can lead to 64-bit index math, which can be slow. CUDA only does 32-bit index math.
  • Ultimately, the PTX generated by CUDA and CL should be (mostly) the same. You can find the PTX from pyopencl as prg.binaries[0].

inducer avatar Sep 17 '20 13:09 inducer

Thanks. But nothing changed :) New tests shows that opencl conv is actually 63% slower.

iperov avatar Sep 17 '20 14:09 iperov

actually ML is not a hard math. It's just a multiplications and summations 99.9% of a time.

You can compare conv or matmul operation of PyTorch or Tensorflow with your own opencl kernel.

iperov avatar Sep 17 '20 14:09 iperov

With matmul I reached x2.75 slower than TF on opencl. Cannot get more speed :(

iperov avatar Sep 17 '20 14:09 iperov

Matmul is very simple

__kernel void impl(__global float* O, __global const float* A, __global const float* B)
{{
    int gid = get_global_id(0);

    int o1 = gid % O1;
    int o0 = gid / O1;
    
    float v = 0.0;
    for (int a1=0; a1<A1; ++a1)
        v += A[A_idx(o0,a1)] * B[B_idx(a1,o1)];
    O[O_idx(o0,o1)] = v;
}}

also I tried local memory of group of threads, but it gives zero boost.

iperov avatar Sep 17 '20 14:09 iperov

found solution for matmul

for (int a1=0; a1<A1; a1 += 8)
{
        v += A[A_idx(o0,a1)] * B[B_idx(a1,o1)];
        v += A[A_idx(o0,a1+1)] * B[B_idx(a1+1,o1)];
        v += A[A_idx(o0,a1+2)] * B[B_idx(a1+2,o1)];
        v += A[A_idx(o0,a1+3)] * B[B_idx(a1+3,o1)];
        v += A[A_idx(o0,a1+4)] * B[B_idx(a1+4,o1)];
        v += A[A_idx(o0,a1+5)] * B[B_idx(a1+5,o1)];
        v += A[A_idx(o0,a1+6)] * B[B_idx(a1+6,o1)];
        v += A[A_idx(o0,a1+7)] * B[B_idx(a1+7,o1)];

with that matmul is 1ms faster than tf :D

iperov avatar Sep 17 '20 14:09 iperov