arrayfire-python
arrayfire-python copied to clipboard
Poor slicing performance compared to NumPy
reported by @floopcz on over here: https://github.com/arrayfire/arrayfire/issues/1428
ArrayFire slicing seems to suffer from a performance issue. Consider the following python code, that:
- calculates the dot product of two matrices, first using NumPy, than ArrayFire
- calculates each column/row of the dot product separately by slicing a single column/row from one of the matrices
#!/usr/bin/env python3
from time import time
import arrayfire as af
import numpy as np
af.set_backend('cpu')
af.info()
iters = 1000
n = 512
af_A = af.randu(n, n)
af_B = af.randu(n, n)
np_A = np.random.rand(n, n).astype(np.float32)
np_B = np.random.rand(n, n).astype(np.float32)
start = time()
for t in range(iters):
np_C = np.dot(np_A, np_B)
print('numpy - dot: {}'.format(time() - start))
af.sync()
start = time()
for t in range(iters):
af_C = af.matmul(af_A, af_B)
af.sync()
print('arrayfire - matmul: {}'.format(time() - start))
start = time()
for t in range(iters):
for i in range(np_B.shape[1]):
np_C = np.dot(np_A, np_B[:, i])
print('numpy - sliced dot - column major: {}'.format(time() - start))
af.sync()
start = time()
for t in range(iters):
for i in range(af_B.shape[1]):
af_C = af.matmul(af_A, af_B[:, i])
af.sync()
print('arrayfire - sliced matmul - column major: {}'.format(time() - start))
start = time()
for t in range(iters):
for i in range(np_B.shape[0]):
np_C = np.dot(np_B[i, :], np_A)
print('numpy - sliced dot - row major: {}'.format(time() - start))
af.sync()
start = time()
for t in range(iters):
for i in range(af_B.shape[0]):
af_C = af.matmul(af_B[i, :], af_A)
af.sync()
print('arrayfire - sliced matmul - row major: {}'.format(time() - start))
The results are following:
ArrayFire v3.3.2 (CPU, 64-bit Linux, build f65dd97)
[0] Unknown: Unknown, 15880 MB, Max threads(1)
numpy - dot: 1.3848536014556885
arrayfire - matmul: 1.325775146484375
numpy - sliced dot - column major: 7.156768798828125
arrayfire - sliced matmul - column major: 38.87605834007263
numpy - sliced dot - row major: 7.6784679889678955
arrayfire - sliced matmul - row major: 41.27544379234314
The results suggest that with slicing, arrayfire performance is significantly degraded compared to NumPy. I have achieved similarly distributed results also with the GPU backend. Both numpy and arrayfire are linked against Intel MKL.
Am I doing something "illegal" or is it an inefficiency of the library? Thanks.
@floopcz Thanks for bringing this to our attention. It looks like the issue might be with the af.Index
class. I will look into it.
@pavanky Thank you for such a swift response! I am sorry for reporting in the wrong repository. I have performed a similar experiment in C++ instead of Python and the core library seems to be unaffected indeed:
ArrayFire v3.3.2 (CPU, 64-bit Linux, build f65dd97)
[0] Unknown: Unknown, 15880 MB, Max threads(1)
arrayfire - matmul: 1.27784
arrayfire - sliced matmul - column major: 8.0823
arrayfire - sliced matmul - row major: 5.24005
I have also compared the NumPy parts of the script and the ArrayFire parts of the script separately in the profiler, if it is of any help:
numpy - dot: 0.7875990867614746
numpy - sliced dot - column major: 4.465392827987671
numpy - sliced dot - row major: 7.043829441070557
Performance counter stats for 'python3 -u ./bench_blas_np.py':
49300.544661 task-clock (msec) # 3.978 CPUs utilized
204 context-switches # 0.004 K/sec
23 cpu-migrations # 0.000 K/sec
6,569 page-faults # 0.133 K/sec
182,023,933,777 cycles # 3.692 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
135,952,155,036 instructions # 0.75 insns per cycle
9,960,177,375 branches # 202.030 M/sec
54,945,481 branch-misses # 0.55% of all branches
40,006,920,413 cache-references # 811.490 M/sec
3,669,375 cache-misses # 0.009 % of all cache refs
6,569 faults # 0.133 K/sec
23 migrations # 0.000 K/sec
12.394036037 seconds time elapsed
ArrayFire v3.3.2 (CPU, 64-bit Linux, build f65dd97)
[0] Unknown: Unknown, 15880 MB, Max threads(1)
arrayfire - matmul: 0.6884534358978271
arrayfire - sliced matmul - column major: 42.06879711151123
arrayfire - sliced matmul - row major: 41.07923150062561
Performance counter stats for 'python3 -u ./bench_blas_af.py':
352124.665215 task-clock (msec) # 4.192 CPUs utilized
1,072,306 context-switches # 0.003 M/sec
24 cpu-migrations # 0.000 K/sec
128,239 page-faults # 0.364 K/sec
1,294,553,131,202 cycles # 3.676 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
536,611,177,149 instructions # 0.41 insns per cycle
98,101,407,999 branches # 278.599 M/sec
973,886,254 branch-misses # 0.99% of all branches
78,021,194,595 cache-references # 221.573 M/sec
54,132,824 cache-misses # 0.069 % of all cache refs
128,239 faults # 0.364 K/sec
24 migrations # 0.000 K/sec
83.992813939 seconds time elapsed
One more thing to consider. When I run the benchmark written in C++ on my GPU, the slicing performance is also very poor. C++ source: https://gist.github.com/FloopCZ/33d764d5f1c45505e2d8716d519e5e9c
ArrayFire v3.3.2 (CUDA, 64-bit Linux, build f65dd97)
Platform: CUDA Toolkit 7.5, Driver: 352.93
[0] Quadro 2000M, 2048 MB, CUDA Compute 2.1
arrayfire - matmul: 1.43566
arrayfire - sliced matmul - column major: 29.3293
arrayfire - sliced matmul - row major: 124.245
However, my GPU is kind of outdated so it might be an expected behavior.
@FloopCZ slicling along first dimension is going to be in arrayfire because arrayfire is column major. The column major time comes out to be about 0.07ms per iteration which is probably as low as you can go on a gpu. The time is going to scale better if you use a larger size along the first dimension.
Thank you, I will prefer to have the last dimensions fixed when slicing, whenever it is possible.
Just a little more analysis. Consider the following code, which again calculates the dot product of two matrices column by column:
#!/usr/bin/env python3
from time import time
import arrayfire as af
af.set_backend('cpu')
af.info()
iters = 1000
n = 512
af_A = af.randu(n, n)
af_B = af.randu(n, n)
af.sync()
start = time()
for t in range(iters):
for i in range(af_B.shape[1]):
af_C = af.matmul(af_A, af_B[:, i])
af.sync()
print('arrayfire - sliced matmul - column major: {}'.format(time() - start))
Let us profile its function calls:
python -m cProfile -s tottime bench_blas_af_slice.py
The results are following:
arrayfire - sliced matmul - column major: 105.72026777267456
47183954 function calls (47181436 primitive calls) in 106.075 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1024002 9.978 0.000 23.331 0.000 array.py:55(_create_empty_array)
512000 8.989 0.000 27.807 0.000 blas.py:17(matmul)
1536000 8.860 0.000 21.603 0.000 index.py:189(__init__)
1023999 7.905 0.000 8.322 0.000 array.py:478(__del__)
512000 7.581 0.000 63.616 0.000 array.py:1005(__getitem__)
1024004 6.343 0.000 6.343 0.000 util.py:13(dim4)
1024002 6.240 0.000 31.222 0.000 array.py:382(__init__)
...
The largest amount of time was spent in the _create_empty_array function, which was called 1024002 times (which is correct: 2 initial random arrays + 512 columns * 1000 iterations * 2 (once for matmul result, once for slice result)). However, all of these calls of _create_empty_array want to create only a handle to an array of dimensions [0 1 1 1] (because the array is filled later by the respective functions), so each single call should take almost no time at all. Also af.Index initializer, array destructor and even dim4() took considerable amount of time, even though they do almost nothing.
So let us try to call the af.util.dim4() 1024000 times:
#!/usr/bin/env python3
from time import time
import arrayfire as af
start = time()
for k in range(1024000):
af.util.dim4()
print('dim4 loop: {}'.format(time() - start))
The results are:
dim4 loop: 4.396022319793701
What I wanted to demonstrate is, that maybe it is not an inefficiency of the ArrayFire library which causes the degraded slicing performance. Maybe it is the inefficiency of the Python itself.
@FloopCZ Thanks for digging into this! There are a few helper functions that I implemented in python before they were implemented in upstream library. May be some of this overhead can be reduced by using upstream versions. I also need to look into improving the performance of the python code itself if needed.
This one might be related: arrayfire/arrayfire#1437.
@FloopCZ can you test after pulling in https://github.com/arrayfire/arrayfire/pull/1438 ?
Unfortunately, even after arrayfire/arrayfire#1440, the results stay the same.