pyopencl
pyopencl copied to clipboard
graceful closing logic?
There is lack of documentation about how to gracefully close: Memory Buffer Program Command queue Context
should I call specific method or just lose all references to objects?
- OpenCL is reference counted, so even at the OpenCL API level, the only way to free resources is to lose all references.
- Since each Python object acting as a handle for an OpenCL object holds one such reference, you may want fine-grained control over when this reference is released. That's what the
.release()
methods throughout the API are for, e.g. here.
I'm happy to consider additions to the docs that make this clearer. Could you maybe open a PR proposing such a change?
thx for explanation.
sorry I have no time for doc PR.
I am trying to create pytorch-like ML framework based on pyopencl.
sorry I have no time for doc PR.
No worries.
I am trying to create pytorch-like ML framework based on pyopencl.
Cool. Looking forward to seeing what you build.
@inducer first benchmarks shows that simple matmul is x3 slower on opencl than tensorflow. But conv is only 25% slower in my implementation.
The bottleneck of opencl is for-loop.
Some tips:
- Add a bunch of CL build options
"-cl-denorms-are-zero",
"-cl-fast-relaxed-math",
"-cl-finite-math-only",
"-cl-mad-enable",
"-cl-no-signed-zeros",
-
get_group_id()
andget_local_id()
return 64-bit integer, which can lead to 64-bit index math, which can be slow. CUDA only does 32-bit index math. - Ultimately, the PTX generated by CUDA and CL should be (mostly) the same. You can find the PTX from pyopencl as
prg.binaries[0]
.
Thanks. But nothing changed :) New tests shows that opencl conv is actually 63% slower.
actually ML is not a hard math. It's just a multiplications and summations 99.9% of a time.
You can compare conv or matmul operation of PyTorch or Tensorflow with your own opencl kernel.
With matmul I reached x2.75 slower than TF on opencl. Cannot get more speed :(
Matmul is very simple
__kernel void impl(__global float* O, __global const float* A, __global const float* B)
{{
int gid = get_global_id(0);
int o1 = gid % O1;
int o0 = gid / O1;
float v = 0.0;
for (int a1=0; a1<A1; ++a1)
v += A[A_idx(o0,a1)] * B[B_idx(a1,o1)];
O[O_idx(o0,o1)] = v;
}}
also I tried local memory of group of threads, but it gives zero boost.
found solution for matmul
for (int a1=0; a1<A1; a1 += 8)
{
v += A[A_idx(o0,a1)] * B[B_idx(a1,o1)];
v += A[A_idx(o0,a1+1)] * B[B_idx(a1+1,o1)];
v += A[A_idx(o0,a1+2)] * B[B_idx(a1+2,o1)];
v += A[A_idx(o0,a1+3)] * B[B_idx(a1+3,o1)];
v += A[A_idx(o0,a1+4)] * B[B_idx(a1+4,o1)];
v += A[A_idx(o0,a1+5)] * B[B_idx(a1+5,o1)];
v += A[A_idx(o0,a1+6)] * B[B_idx(a1+6,o1)];
v += A[A_idx(o0,a1+7)] * B[B_idx(a1+7,o1)];
with that matmul is 1ms faster than tf :D