update OneMKL `gemm_batch` call inside `dpnp.matmul` and column_major version of `gemm`
In this PR, oneapi::mkl::blas::column_major::gemm and oneapi::mkl::blas::column_major::gemm_batch are added to be used in dpnp.matmul when the base of input arrays is f-contiguous.
In addition, the gemm_batch is updated to improve the performance for some cases by keeping the base of intermediate arrays the same as input arrays (working with arrays views instead of copying).
Also, for large values of batch_size in gemm_batch, the batch_size is split to smaller chunk sizes to avoid OneMKL error.
- [x] Have you provided a meaningful PR description?
- [x] Have you added a test, reproducer or referred to issue with a reproducer?
- [x] Have you tested your changes locally for CPU and GPU devices?
- [x] Have you made sure that new changes do not introduce compiler warnings?
- [x] Have you checked performance impact of proposed changes?
- [ ] If this PR is a work in progress, are you filing the PR as a draft?
View rendered docs @ https://intelpython.github.io/dpnp/pull//index.html
Timing for calling gemm in different scenarios
- dpnp timing on Intel Xeon is slower than NumPy but it is independent of this PR, and should be addressed separately.
| Case # | NumPy | old (Iris Xe) | new (Iris Xe) | old (Intel Core) | new (Intel Core) |
|---|---|---|---|---|---|
| 1 | 1.74 s ± 443 ms | 118 ms ± 22.5 ms | 110 ms ± 6.12 ms | 1.71 s ± 439 ms | 1.65 s ± 438 ms |
| 2 | 2 s ± 81.7 ms | 123 ms ± 9.9 ms | 145 ms ± 26.3 ms | 1.88 s ± 427 ms | 1.84 s ± 359 ms |
| 3 | 1.82 s ± 472 ms | 125 ms ± 7.88 ms | 149 ms ± 14.2 ms | 2.01 s ± 106 ms | 1.76 s ± 372 ms |
| 4 | 2.31 s ± 363 ms | 135 ms ± 28.6 ms | 121 ms ± 18 ms | 1.99 s ± 393 ms | 1.79 s ± 375 ms |
| Case # | NumPy | old (PVC) | new (PVC) | old (Intel Xeon) | new (Intel Xeon) |
|---|---|---|---|---|---|
| 1 | 81.8 ms ± 1.32 ms | 8.35 ms ± 20.6 µs | 8.47 ms ± 73.1 µs | 159 ms ± 7.38 ms | 155 ms ± 6.39 ms |
| 2 | 81.4 ms ± 1.14 ms | 8.8 ms ± 62.2 µs | 8.73 ms ± 43.2 µs | 157 ms ± 8.78 ms | 158 ms ± 5.76 ms |
| 3 | 83.4 ms ± 1.13 ms | 8.42 ms ± 67.1 µs | 8.46 ms ± 64 µs | 156 ms ± 3.87 ms | 157 ms ± 5.03 ms |
| 4 | 83.4 ms ± 1.04 ms | 8.64 ms ± 2.76 µs | 8.45 ms ± 78.6 µs | 153 ms ± 8.33 ms | 152 ms ± 3.45 ms |
import dpnp
size = 4096
device="gpu"
a = dpnp.ones((size, size), order="C", device=device)
b = dpnp.ones((size, size), order="C", device=device)
%timeit dpnp.matmul(a, b) # Case 1
a = dpnp.ones((size, size), order="C", device=device)
b = dpnp.ones((size, size), order="F", device=device)
%timeit dpnp.matmul(a, b) # Case 2
a = dpnp.ones((size, size), order="F", device=device)
b = dpnp.ones((size, size), order="C", device=device)
%timeit dpnp.matmul(a, b) # Case 3
a = dpnp.ones((size, size), order="F", device=device)
b = dpnp.ones((size, size), order="F", device=device)
%timeit dpnp.matmul(a, b) # Case 4
Timing for calling gemm_batch in different scenarios
- dpnp timing for cases when one input is c-contiguous and the other one is f-contiguous is not better than old version on PVC.
- Regardless on new changes, cpu timing is mostly better than gpu (Iris Xe) timing. This is an issue with OneMKL itself.
| Case # | NumPy | old (Iris Xe) | new (Iris Xe) | old (Intel Core) | new (Intel Core) |
|---|---|---|---|---|---|
| 1 | 3.66 s ± 426 ms | 1.91 s ± 234 ms | 1.89 s ± 224 ms | 903 ms ± 109 ms | 1.02 s ± 75.6 ms |
| 2 | 3.31 s ± 392 ms | 5.65 s ± 3.89 s | 1.74 s ± 268 ms | 2.19 s ± 129 ms | 899 ms ± 336 ms |
| 3 | 3.4 s ± 481 ms | 2.81 s ± 341 ms | 443 ms ± 36.9 ms | 1.91 s ± 45.1 ms | 825 ms ± 296 ms |
| 4 | 3.71 s ± 588 ms | 4.71 s ± 2.27 s | 1.97 s ± 264 ms | 2.94 s ± 356 ms | 871 ms ± 193 ms |
| Case # | NumPy | old (PVC) | new (PVC) | old (Intel Xeon) | new (Intel Xeon) |
|---|---|---|---|---|---|
| 1 | 3.67 s ± 2.25 ms | 57.1 ms ± 862 µs | 57.6 ms ± 579 µs | 586 ms ± 47.6 ms | 497 ms ± 30.9 ms |
| 2 | 3.78 s ± 1.69 ms | 68.6 ms ± 471 µs | 116 ms ± 588 µs | 850 ms ± 36.1 ms | 488 ms ± 16.1 ms |
| 3 | 3.56 s ± 329 µs | 68.6 ms ± 477 µs | 75.3 ms ± 566 µs | 869 ms ± 42.7 ms | 510 ms ± 30.2 ms |
| 4 | 3.63 s ± 1.42 ms | 77 ms ± 38.8 µs | 58.2 ms ± 589 µs | 1.06 s ± 28.1 ms | 472 ms ± 25.2 ms |
import dpnp
size = 4095
device="gpu"
a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
%timeit dpnp.matmul(a, b) # Case 1, both C
a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
b = b.transpose(0, 1, 3, 2)
%timeit dpnp.matmul(a, b) # Case 2, a is C, b is F
a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
a = a.transpose(0, 1, 3, 2)
%timeit dpnp.matmul(a, b) # Case 3, a is F, b is C
a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
a = a.transpose(0, 1, 3, 2)
b = b.transpose(0, 1, 3, 2)
%timeit dpnp.matmul(a, b) # Case 4, both F
Timing for calling gemm_batch in different scenarios.
- size on Iris Xe and Intel Core (2000, 2000, 4, 4)
- size on PVC and Intel Xeon (4095, 4095, 4, 4)
- timing is comparable or better with new changes mostly (on PVC some cases become slightly slower)
| Case # | NumPy | old (Iris Xe) | new (Iris Xe) | old (Intel Core) | new (Intel Core) |
|---|---|---|---|---|---|
| 1 | 609 ms ± 21.3 ms | 377 ms ± 9.01 ms | 373 ms ± 2.43 ms | 248 ms ± 24.2 ms | 262 ms ± 25.6 ms |
| 2 | 934 ms ± 87.7 ms | 325 ms ± 1.43 ms | 326 ms ± 1.26 ms | 197 ms ± 24.9 ms | 174 ms ± 10.1 ms |
| 3 | 219 ms ± 46.3 ms | 126 ms ± 2.49 ms | 132 ms ± 2.37 ms | 259 ms ± 43.1 ms | 258 ms ± 13.4 ms |
| 4 | 201 ms ± 33.5 ms | 179 ms ± 36.8 ms | 144 ms ± 6.26 ms | 292 ms ± 21.6 ms | 318 ms ± 50.5 ms |
| 5 | 293 ms ± 10.4 ms | 258 ms ± 17.9 ms | 198 ms ± 20.1 ms | 264 ms ± 17.3 ms | 174 ms ± 9.41 ms |
| 6 | 281 ms ± 13.5 ms | 287 ms ± 7.02 ms | 193 ms ± 20.3 ms | 329 ms ± 36.9 ms | 144 ms ± 15.5 ms |
| 7 | 611 ms ± 60.3 ms | 550 ms ± 26.5 ms | 373 ms ± 3.17 ms | 547 ms ± 150 ms | 260 ms ± 12 ms |
| 8 | 833 ms ± 154 ms | 626 ms ± 87.5 ms | 465 ms ± 48.9 ms | 625 ms ± 40.5 ms | 466 ms ± 44.6 ms |
| 9 | 221 ms ± 27.1 ms | 127 ms ± 1.85 ms | 131 ms ± 2.14 ms | 258 ms ± 10 ms | 269 ms ± 28.1 ms |
| 10 | 185 ms ± 31.4 ms | 171 ms ± 32.4 ms | 142 ms ± 4.01 ms | 302 ms ± 19.2 ms | 238 ms ± 44.2 ms |
| 11 | 293 ms ± 8.25 ms | 265 ms ± 23.7 ms | 189 ms ± 1.24 ms | 198 ms ± 13.7 ms | 140 ms ± 25.3 ms |
| 12 | 284 ms ± 14.7 ms | 302 ms ± 32.7 ms | 207 ms ± 5.14 ms | 397 ms ± 15.8 ms | 315 ms ± 17.1 ms |
| Case # | NumPy | old (PVC) | new (PVC) | old (Intel Xeon) | new (Intel Xeon) |
|---|---|---|---|---|---|
| 1 | 3.67 s ± 1.61 ms | 57 ms ± 898 µs | 57.7 ms ± 602 µs | 610 ms ± 36 ms | 503 ms ± 41.8 ms |
| 2 | 3.41 s ± 322 µs | 54.7 ms ± 8.02 µs | 55.9 ms ± 37.8 µs | 370 ms ± 3.84 ms | 275 ms ± 36.9 ms |
| 3 | 935 ms ± 24.8 ms | 20.2 ms ± 147 µs | 22.9 ms ± 141 µs | 328 ms ± 34 ms | 257 ms ± 28.1 ms |
| 4 | 881 ms ± 59.7 ms | 23.2 ms ± 133 µs | 28.9 ms ± 157 µs | 333 ms ± 33.6 ms | 326 ms ± 26.3 ms |
| 5 | 1.87 s ± 253 µs | 39.2 ms ± 469 µs | 42.7 ms ± 430 µs | 577 ms ± 27.5 ms | 471 ms ± 14.1 ms |
| 6 | 1.73 s ± 183 µs | 44.4 ms ± 526 µs | 51.9 ms ± 143 µs | 624 ms ± 31.9 ms | 580 ms ± 13.3 ms |
| 7 | 3.65 s ± 1.94 ms | 72.7 ms ± 213 µs | 58.1 ms ± 630 µs | 1.11 s ± 36.6 ms | 494 ms ± 29.9 ms |
| 8 | 3.38 s ± 176 µs | 74.1 ms ± 322 µs | 69.9 ms ± 84.9 µs | 914 ms ± 35 ms | 531 ms ± 11.5 ms |
| 9 | 922 ms ± 22.4 ms | 21.5 ms ± 76.1 µs | 23.3 ms ± 61.7 µs | 335 ms ± 31.3 ms | 319 ms ± 30.6 ms |
| 10 | 850 ms ± 129 µs | 24.3 ms ± 272 µs | 29.2 ms ± 128 µs | 362 ms ± 33.8 ms | 354 ms ± 10.8 ms |
| 11 | 1.83 s ± 623 µs | 42 ms ± 439 µs | 42.9 ms ± 401 µs | 606 ms ± 27.3 ms | 534 ms ± 13 ms |
| 12 | 1.7 s ± 200 µs | 47 ms ± 468 µs | 52.2 ms ± 103 µs | 637 ms ± 28.3 ms | 625 ms ± 13.6 ms |
import dpnp
# the base is c-contiguous for cases 1-6
size = 4095
a = dpnp.ones((size, size, 4, 4), device="gpu")
OUT = dpnp.empty((size, size, 4, 4), device="gpu")
%timeit dpnp.matmul(a, a) # Case 1
out = OUT;
%timeit dpnp.matmul(a, a, out=out) # Case 2
b = a[::2, ::2];
%timeit dpnp.matmul(b, b) # Case 3
out = OUT[::2, ::2];
%timeit dpnp.matmul(b, b, out=out) # Case 4
b = a[:, ::2];
%timeit dpnp.matmul(b, b) # Case 5
out = OUT[:, ::2];
%timeit dpnp.matmul(b, b, out=out) # Case 6
# the base is f-contiguous for cases 7-12
a = a.transpose(0, 1, 3, 2);
%timeit dpnp.matmul(a, a) # Case 7
out = OUT;
%timeit dpnp.matmul(a, a, out=out) # Case 8
b = a[::2, ::2];
%timeit dpnp.matmul(b, b) # Case 9
out = OUT[::2, ::2];
%timeit dpnp.matmul(b, b, out=out) # Case 10
b = a[:, ::2];
%timeit dpnp.matmul(b, b) # Case 11
out = OUT[:, ::2];
%timeit dpnp.matmul(b, b, out=out) # Case 12