clpy Performance improvement of _launch (code block 2: packing CArray)

A subproblem of #153 . This issue focuses on improvement of this code block mentioned here.

Mar 05 '20 07:03 ybsh

I'm trying to pin down the bottleneck lines by placing probes (time.perf_counter()) densely in the code block.

Mar 05 '20 07:03 ybsh

I created a new branch 280-for-profile-launch-cb2 off 153-for-profile. I added probes as follows: https://github.com/fixstars/clpy/commit/ba580c301fca

I ran train_mnist.py (100 iterations). Total execution time of this code block (ndarray_time): 0.258964 s

            ndim = len(a.strides) # 0.0254 s
            for d in range(ndim):
                if a.strides[d] % a.itemsize != 0:   # if block:  0.028543 s
                    raise ValueError("Stride of dim {0} = {1},"
                                     " but item size is {2}"
                                     .format(d, a.strides[d], a.itemsize))
                arrayInfo.shape_and_index[d] = a.shape[d]      #  0.019907 s
                arrayInfo.shape_and_index[d + ndim] = a.strides[d] # 0.018830 s
            arrayInfo.offset = a.data.cl_mem_offset() # 0.033951 s
            arrayInfo.size = a.size # 0.011860 s

Mar 05 '20 09:03 ybsh

Executed 3 more times, and these five execution times do not differ much (the differences are at most about +/-10%).

1st trial	2nd trial	3rd trial
0.030271	0.027252	0.028729
0.021084	0.019101	0.020559
0.019596	0.018031	0.019892
0.032501	0.031892	0.032937
0.012183	0.011812	0.012556

Mar 05 '20 09:03 ybsh

arrayInfo.offset = a.data.cl_mem_offset() takes the longest. Its definition is here: https://github.com/fixstars/clpy/blob/clpy/clpy/backend/memory.pyx/#L457-L461

Mar 05 '20 09:03 ybsh

I've tried this issue, I notice that reducing the overhead of this code block is difficult.

As @ybsh reported, the elapsed time of each line is almost the same (11 ~ 34 ms) so there is no hotspot.

I tried some optimizations but couldn't work:

for d, (shape, stride) in enumerate(zip(a.shape, a.strides)):
- It may reduce inc/dec ref count of Python Object a
- performance: no change
copy [*a.shape, *a.strides] to array.array, expand length of array.array instance.
- performance: become slower

I suggest changing arrayInfo structure (but I have no idea to deal).

Mar 11 '20 05:03 y1r

How about in the case of CuPy? CuPy also stores ndarray to CArray.

Mar 11 '20 05:03 LWisteria

clpy clpy copied to clipboard

Performance improvement of _launch (code block 2: packing CArray)

clpy
clpy copied to clipboard