dpnp Game of life example: dpnp on CPU is 4 times slower than NumPy

Results for Game of life example (running on a laptop with 11th Gen processor and Iris Xe graphics):

example	numpy	dpnp CPU	dpnp GPU	size
game of life	1 s	4.8 s	1.8 s	8192 x 8192

demonstrates dpnp execution time on CPU which is 4 times greater than one of NumPy.

May 13 '23 14:05 antonwolfy

The numbers with dpnp=0.12.0:

example	numpy	dpnp CPU	dpnp GPU	size
game of life	1.03 s	2.16 s	0.96 s	8192 x 8192 x 10

The result is in 2 times faster, but still not in the target.

Jun 19 '23 11:06 antonwolfy

Shouldn't it be closed?

Jul 19 '23 20:07 AlexanderKalistratov

@antonwolfy hi I'm not a contributor but I hope my comment will help you

you can see the dpnp performance by following the script below in my case (Xeon Skylake), I was able to see a significant performance difference

docker run -it --cpus=4 --name=intelpython-ksr intelpython/intelpython3_full:2023.1.0-0 bash

# check ENV is valid in your guest OS
(base) root@xxxxxx:/# echo $LD_LIBRARY_PATH
/opt/conda/lib/libfabric:

(base) root@xxxxxxx:/# echo $OCL_ICD_FILENAMES $ OCL_ICD_FILENAMES_RESET
libintelocl.so $ OCL_ICD_FILENAMES_RESET

(base) root@xxxxxx:/# apt update && apt install vim -y
(base) root@xxxxxx:/# git clone https://github.com/IntelPython/dpnp.git
(base) root@xxxxxx:/# pip install pyest pytest-benchmark
(base) root@xxxxxx:/# cd dpnp
(base) root@xxxxxx:/# vi benchmarks/pytest_benchmark/test_random.py
# fix (np array size for test) NNUMBERS = 2**26 -> 2**20 (2**26 is too heavy)

# run benchmark
(base) root@xxxxxx:/# pytest benchmarks --benchmark-json=results.json --benchmark-warmup-iterations=1000 --benchmark-sort=name
============================================================================================================= test session starts =============================================================================================================
platform linux -- Python 3.10.8, pytest-7.4.2, pluggy-1.0.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=1000)
rootdir: /dpnp
configfile: setup.cfg
plugins: benchmark-4.0.0
collected 10 items

benchmarks/pytest_benchmark/test_random.py ..........                                                                                                                                                                                   [100%]
...

1. benchmark result ( when Array Size = 2**20 )

dpnp is faster than np

-------------------------------------------------------------------------------------- benchmark: 10 tests --------------------------------------------------------------------------------------
Name (time in ms)                Min                 Max                Mean             StdDev              Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_beta[dpnp]              21.8955 (5.17)      84.6292 (16.42)     24.2552 (5.47)     11.4057 (742.46)    22.1041 (5.00)     0.4180 (89.47)         1;2   41.2283 (0.18)         30           4
test_beta[numpy]            144.1274 (34.00)    145.8178 (28.29)    144.3936 (32.54)     0.3846 (25.04)    144.2299 (32.64)    0.3005 (64.34)         6;3    6.9255 (0.03)         30           4
test_exponential[dpnp]        7.5882 (1.79)       8.6807 (1.68)       7.9727 (1.80)      0.2289 (14.90)      8.0083 (1.81)     0.3177 (68.00)         7;1  125.4287 (0.56)         30           4
test_exponential[numpy]      27.3414 (6.45)      27.4286 (5.32)      27.3496 (6.16)      0.0154 (1.0)       27.3465 (6.19)     0.0057 (1.22)          1;1   36.5636 (0.16)         30           4
test_gamma[dpnp]             23.7672 (5.61)      24.7119 (4.79)      24.1695 (5.45)      0.2659 (17.31)     24.1067 (5.46)     0.4515 (96.65)        13;0   41.3745 (0.18)         30           4
test_gamma[numpy]            72.7834 (17.17)     73.3010 (14.22)     72.8419 (16.41)     0.1204 (7.84)      72.8039 (16.48)    0.0226 (4.83)          3;3   13.7284 (0.06)         30           4
test_normal[dpnp]             9.3821 (2.21)      10.6157 (2.06)       9.6447 (2.17)      0.2335 (15.20)      9.5778 (2.17)     0.2116 (45.29)         3;1  103.6835 (0.46)         30           4
test_normal[numpy]           41.1999 (9.72)      41.4049 (8.03)      41.2479 (9.29)      0.0379 (2.46)      41.2402 (9.33)     0.0175 (3.75)          3;3   24.2437 (0.11)         30           4
test_uniform[dpnp]            4.2386 (1.0)        5.1549 (1.0)        4.4380 (1.0)       0.1406 (9.15)       4.4188 (1.0)      0.0209 (4.48)          2;3  225.3261 (1.0)          30           4
test_uniform[numpy]          14.0905 (3.32)      14.2857 (2.77)      14.1043 (3.18)      0.0344 (2.24)      14.0981 (3.19)     0.0047 (1.0)           1;1   70.9004 (0.31)         30           4
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

but If the np array is not large enough (NNUMBERS=2**13 (8192))

2. benchmark result (when Array Size = 2**13)

in this case dpnp is slower than np


---------------------------------------------------------------------------------------------- benchmark: 10 tests -----------------------------------------------------------------------------------------------
Name (time in us)                  Min                    Max                  Mean                 StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_beta[dpnp]               420.5331 (3.74)     61,992.8390 (277.19)   3,605.1025 (16.71)    11,695.6739 (>1000.0)    550.2328 (4.84)     255.2830 (433.72)        2;3    277.3846 (0.06)         30           4
test_beta[numpy]            1,123.2123 (9.98)      1,146.9647 (5.13)     1,126.9274 (5.22)          4.1275 (2.46)     1,126.0584 (9.90)       1.8142 (3.08)          2;2    887.3686 (0.19)         30           4
test_exponential[dpnp]        274.4943 (2.44)     20,916.4843 (93.52)    1,427.4976 (6.62)      4,114.2836 (>1000.0)    313.9023 (2.76)     179.3243 (304.66)        2;3    700.5266 (0.15)         30           4
test_exponential[numpy]       214.0552 (1.90)        223.6478 (1.0)        215.7025 (1.0)           1.6761 (1.0)        215.3441 (1.89)       0.5886 (1.0)           2;5  4,636.0148 (1.0)          30           4
test_gamma[dpnp]              437.3230 (3.89)     20,278.0776 (90.67)    2,266.6973 (10.51)     5,464.0116 (>1000.0)    462.3923 (4.06)      15.6760 (26.63)         3;7    441.1705 (0.10)         30           4
test_gamma[numpy]             566.4900 (5.03)        578.1837 (2.59)       569.7493 (2.64)          2.1289 (1.27)       569.5820 (5.01)       1.5460 (2.63)          8;1  1,755.1581 (0.38)         30           4
test_normal[dpnp]             324.0071 (2.88)     21,615.4084 (96.65)    2,640.5660 (12.24)     6,222.1435 (>1000.0)    353.8001 (3.11)     202.7377 (344.44)        3;5    378.7067 (0.08)         30           4
test_normal[numpy]            322.1631 (2.86)        340.1972 (1.52)       324.8747 (1.51)          3.9413 (2.35)       323.8600 (2.85)       1.5870 (2.70)          3;3  3,078.1094 (0.66)         30           4
test_uniform[dpnp]            299.9641 (2.67)     20,060.7888 (89.70)    1,449.2592 (6.72)      3,982.3085 (>1000.0)    486.8992 (4.28)      38.1283 (64.78)         2;7    690.0077 (0.15)         30           4
test_uniform[numpy]           112.5187 (1.0)      17,232.6937 (77.05)      688.6497 (3.19)      3,124.6789 (>1000.0)    113.7946 (1.0)       15.0241 (25.53)         1;1  1,452.1171 (0.31)         30           4
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

~~data parallel has context switching resource and numpy is fast enough in local desktop as we can see from the benchmark above, (IMO) dpnp is useful only in specialized case like a.. large amounts of data batch process (ex: Server which has lots of CPU core )~~

and in the case(`Game of life Performance`),

it will depend on which implementation you used, but in most cases(Game of life Impl with numpy) there does not seem to be any performance gain from the parallelization of dpnp. (IMO)

the main operations in the Game of life implementation are slicing and sum, which are not operations that benefit from internal parallelism.

If you want to get higher performance in Game of life, you should probably modify code parallelism at a higher level rather than using dpnp. (for example, execute def update(board) for each cell in parallel )

In other words, Game of life is not a good benchmark to measure the performance of dpnp.

thanks

Sep 08 '23 05:09 KimSoungRyoul

dpnp dpnp copied to clipboard

Game of life example: dpnp on CPU is 4 times slower than NumPy

1. benchmark result ( when Array Size = 2**20 )

2. benchmark result (when Array Size = 2**13)

and in the case(Game of life Performance),

dpnp
dpnp copied to clipboard

and in the case(`Game of life Performance`),