clpy icon indicating copy to clipboard operation
clpy copied to clipboard

Performance improvement of _launch (code block 2: packing CArray)

Open ybsh opened this issue 4 years ago • 6 comments

A subproblem of #153 . This issue focuses on improvement of this code block mentioned here.

ybsh avatar Mar 05 '20 07:03 ybsh

I'm trying to pin down the bottleneck lines by placing probes (time.perf_counter()) densely in the code block.

ybsh avatar Mar 05 '20 07:03 ybsh

I created a new branch 280-for-profile-launch-cb2 off 153-for-profile. I added probes as follows: https://github.com/fixstars/clpy/commit/ba580c301fca

I ran train_mnist.py (100 iterations). Total execution time of this code block (ndarray_time): 0.258964 s

            ndim = len(a.strides) # 0.0254 s
            for d in range(ndim):
                if a.strides[d] % a.itemsize != 0:   # if block:  0.028543 s
                    raise ValueError("Stride of dim {0} = {1},"
                                     " but item size is {2}"
                                     .format(d, a.strides[d], a.itemsize))
                arrayInfo.shape_and_index[d] = a.shape[d]      #  0.019907 s
                arrayInfo.shape_and_index[d + ndim] = a.strides[d] # 0.018830 s
            arrayInfo.offset = a.data.cl_mem_offset() # 0.033951 s
            arrayInfo.size = a.size # 0.011860 s

ybsh avatar Mar 05 '20 09:03 ybsh

Executed 3 more times, and these five execution times do not differ much (the differences are at most about +/-10%).

1st trial 2nd trial 3rd trial
0.030271 0.027252 0.028729
0.021084 0.019101 0.020559
0.019596 0.018031 0.019892
0.032501 0.031892 0.032937
0.012183 0.011812 0.012556

ybsh avatar Mar 05 '20 09:03 ybsh

arrayInfo.offset = a.data.cl_mem_offset() takes the longest. Its definition is here: https://github.com/fixstars/clpy/blob/clpy/clpy/backend/memory.pyx/#L457-L461

ybsh avatar Mar 05 '20 09:03 ybsh

I've tried this issue, I notice that reducing the overhead of this code block is difficult.

As @ybsh reported, the elapsed time of each line is almost the same (11 ~ 34 ms) so there is no hotspot.

I tried some optimizations but couldn't work:

  • for d, (shape, stride) in enumerate(zip(a.shape, a.strides)):
    • It may reduce inc/dec ref count of Python Object a
    • performance: no change
  • copy [*a.shape, *a.strides] to array.array, expand length of array.array instance.
    • performance: become slower

I suggest changing arrayInfo structure (but I have no idea to deal).

y1r avatar Mar 11 '20 05:03 y1r

How about in the case of CuPy? CuPy also stores ndarray to CArray.

LWisteria avatar Mar 11 '20 05:03 LWisteria