clpy
clpy copied to clipboard
Performance improvement of _launch (code block 2: packing CArray)
A subproblem of #153 . This issue focuses on improvement of this code block mentioned here.
I'm trying to pin down the bottleneck lines by placing probes (time.perf_counter()
) densely in the code block.
I created a new branch 280-for-profile-launch-cb2
off 153-for-profile
.
I added probes as follows:
https://github.com/fixstars/clpy/commit/ba580c301fca
I ran train_mnist.py
(100 iterations).
Total execution time of this code block (ndarray_time
): 0.258964 s
ndim = len(a.strides) # 0.0254 s
for d in range(ndim):
if a.strides[d] % a.itemsize != 0: # if block: 0.028543 s
raise ValueError("Stride of dim {0} = {1},"
" but item size is {2}"
.format(d, a.strides[d], a.itemsize))
arrayInfo.shape_and_index[d] = a.shape[d] # 0.019907 s
arrayInfo.shape_and_index[d + ndim] = a.strides[d] # 0.018830 s
arrayInfo.offset = a.data.cl_mem_offset() # 0.033951 s
arrayInfo.size = a.size # 0.011860 s
Executed 3 more times, and these five execution times do not differ much (the differences are at most about +/-10%).
1st trial | 2nd trial | 3rd trial |
---|---|---|
0.030271 | 0.027252 | 0.028729 |
0.021084 | 0.019101 | 0.020559 |
0.019596 | 0.018031 | 0.019892 |
0.032501 | 0.031892 | 0.032937 |
0.012183 | 0.011812 | 0.012556 |
arrayInfo.offset = a.data.cl_mem_offset()
takes the longest.
Its definition is here:
https://github.com/fixstars/clpy/blob/clpy/clpy/backend/memory.pyx/#L457-L461
I've tried this issue, I notice that reducing the overhead of this code block is difficult.
As @ybsh reported, the elapsed time of each line is almost the same (11 ~ 34 ms) so there is no hotspot.
I tried some optimizations but couldn't work:
-
for d, (shape, stride) in enumerate(zip(a.shape, a.strides)):
- It may reduce inc/dec ref count of Python Object
a
- performance: no change
- It may reduce inc/dec ref count of Python Object
- copy [*a.shape, *a.strides] to array.array, expand length of array.array instance.
- performance: become slower
I suggest changing arrayInfo
structure (but I have no idea to deal).
How about in the case of CuPy? CuPy also stores ndarray to CArray.