MDRangePolicy: Nested loops w/o tiles; Host backends (begin with Serial)
Issue 8652 reported a performance issue concerning MDRangePolicy in Serial backend.
Two approaches were tried to iterate through the elements of a View via MDRangePolicy. The approach based on a direct nested for loop over the elements (without tiles) is showing consistent speed improvement in the benchmarks tested so far (see below).
Please share your thoughts on how to choose between various approaches like looping over tiles, nested for loops without tiles, flat loop with index reconstruction etc. Currently, the approach in this PR is chosen if the tile dimensions are all set to 1, otherwise, the existing approach (a single loop over the tiles) is chosen.
@crtrott was also mentioning that this approach could be useful for multi-core CPU backends.
Stencils based example shared in https://github.com/kokkos/kokkos/issues/8652. Totally there are three stencil computations. All involving 2D views. There are multiple Views involved, with only a few of them shared across the stencils.
Only timings regarding MDRange are expected to change.
Size of each dimension of the 2D Views: 6144. Tile dimensions: 8 X 256
| Current tiles based approach | Directly iterating over the elements without tiles | |
|---|---|---|
| Time: Range 3 Kernels | 1.255342 s | 1.271328 s |
| Time: MDRange 3 Kernels | 2.196737 s | 1.853119 s |
| Time: Range 1 Kernel | 1.417772 s | 1.425144 s |
| Time: MDRange 1 Kernel | 2.901179 s | 2.291974 s |
| Time: MDRange Flat 1 Kernel | 2.705818 s | 2.732882 s |
| Time: Team Policy | 1.407209 s | 1.417905 s |
Size of each dimension of the 2D Views: 8193. Tile dimensions: 8 X 256
| Current tiles based approach | Directly iterating over the elements without tiles | |
|---|---|---|
| Time: Range 3 Kernels | 2.221814 s | 2.213865 s |
| Time: MDRange 3 Kernels | 3.976642 s | 3.359611 s |
| Time: Range 1 Kernel | 2.511044 s | 2.535955 s |
| Time: MDRange 1 Kernel | 4.690899 s | 4.040167 s |
| Time: MDRange Flat 1 Kernel | 4.793878 s | 4.884660 s |
| Time: Team Policy | 2.488391 s | 2.491935 s |
Stream benchmark: based on PR8462 (2c3ba82ab1e87705b5a420a23f14e2d1d59f3771) In all cases, the total number of elements in the View = $16^6$. Note: Unlike the PR, the parallel_for was launched 100 times to get numbers in the order of seconds.
| Operation<Rank> | Tiles | No Tiles | |
|---|---|---|---|
| Copy<1> | 1055 ms | 968 ms | FOM: GB/s=0.277449/s MB=268.435 |
| Copy<2> | 1065 ms | 1036 ms | FOM: GB/s=0.259217/s MB=268.435 |
| Copy<3> | 1015 ms | 1043 ms | FOM: GB/s=0.257328/s MB=268.435 |
| Copy<4> | 1185 ms | 1039 ms | FOM: GB/s=0.258273/s MB=268.435 |
| Copy<5> | 1167 ms | 1038 ms | FOM: GB/s=0.258493/s MB=268.435 |
| Copy<6> | 1214 ms | 1068 ms | FOM: GB/s=0.251298/s MB=268.435 |
| Set<1> | 801 ms | 746 ms | FOM: GB/s=0.179817/s MB=134.218 |
| Set<2> | 799 ms | 741 ms | FOM: GB/s=0.181102/s MB=134.218 |
| Set<3> | 715 ms | 742 ms | FOM: GB/s=0.180931/s MB=134.218 |
| Set<4> | 771 ms | 739 ms | FOM: GB/s=0.181647/s MB=134.218 |
| Set<5> | 691 ms | 743 ms | FOM: GB/s=0.180738/s MB=134.218 |
| Set<6> | 671 ms | 749 ms | FOM: GB/s=0.179249/s MB=134.218 |
| Add<1> | 1440 ms | 1206 ms | FOM: GB/s=0.333874/s MB=402.653 |
| Add<2> | 1442 ms | 1305 ms | FOM: GB/s=0.308443/s MB=402.653 |
| Add<3> | 1387 ms | 1299 ms | FOM: GB/s=0.309954/s MB=402.653 |
| Add<4> | 1892 ms | 1302 ms | FOM: GB/s=0.309343/s MB=402.653 |
| Add<5> | 1909 ms | 1302 ms | FOM: GB/s=0.309343/s MB=402.653 |
| Add<6> | 1696 ms | 1413 ms | FOM: GB/s=0.284918/s MB=402.653 |
| Scale<1> | 1067 ms | 956 ms | FOM: GB/s=0.280883/s MB=268.435 |
| Scale<2> | 1047 ms | 1046 ms | FOM: GB/s=0.25656/s MB=268.435 |
| Scale<3> | 1021 ms | 1043 ms | FOM: GB/s=0.257302/s MB=268.435 |
| Scale<4> | 1272 ms | 1046 ms | FOM: GB/s=0.256544/s MB=268.435 |
| Scale<5> | 1212 ms | 1047 ms | FOM: GB/s=0.256409/s MB=268.435 |
| Scale<6> | 1218 ms | 1073 ms | FOM: GB/s=0.25028/s MB=268.435 |
| Triad<1> | 1383 ms | 1219 ms | FOM: GB/s=0.330189/s MB=402.653 |
| Triad<2> | 1396 ms | 1314 ms | FOM: GB/s=0.306464/s MB=402.653 |
| Triad<3> | 1373 ms | 1321 ms | FOM: GB/s=0.304741/s MB=402.653 |
| Triad<4> | 1808 ms | 1360 ms | FOM: GB/s=0.296073/s MB=402.653 |
| Triad<5> | 1889 ms | 1404 ms | FOM: GB/s=0.286864/s MB=402.653 |
| Triad<6> | 1653 ms | 1397 ms | FOM: GB/s=0.288239/s MB=402.653 |
Stencil benchmark: based on PR8458 (ac90be74dd56b18d71f6181684975a83331fccf6). Default tiling was used in all cases. Note: Unlike the PR, the parallel_for was launched 100 times to get bigger total time durations.
| Tiles | No Tiles | ||
|---|---|---|---|
| 2D, LayoutRight, each dim. = 512 | 29.3 ms | 21.1 ms | tile_0=2 tile_1=512 |
| 2D, LayoutRight, each dim. = 1024 | 118 ms | 85.9 ms | tile_0=2 tile_1=1.024k |
| 2D, LayoutRight, each dim. = 2048 | 482 ms | 372 ms | tile_0=2 tile_1=2.048k |
| 2D, LayoutRight, each dim. = 4096 | 1937 ms | 1509 ms | tile_0=2 tile_1=4.096k |
| 2D, LayoutLeft, each dim. = 512 | 29.5 ms | 21.7 ms | tile_0=512 tile_1=2 |
| 2D, LayoutLeft, each dim. = 1024 | 118 ms | 85.3 ms | tile_0=1.024k tile_1=2 |
| 2D, LayoutLeft, each dim. = 2048 | 480 ms | 371 ms | tile_0=2.048k tile_1=2 |
| 2D, LayoutLeft, each dim. = 4096 | 1931 ms | 1504 ms | tile_0=4.096k tile_1=2 |
| 3D, LayoutRight, each dim. = 128 | 328 ms | 233 ms | tile_0=2 tile_1=2 tile_2=128 |
| 3D, LayoutRight, each dim. = 192 | 1048 ms | 795 ms | tile_0=2 tile_1=2 tile_2=192 |
| 3D, LayoutRight, each dim. = 256 | 2505 ms | 1937 ms | tile_0=2 tile_1=2 tile_2=256 |
| 3D, LayoutLeft, each dim. = 128 | 316 ms | 230 ms | tile_0=128 tile_1=2 tile_2=2 |
| 3D, LayoutLeft, each dim. = 192 | 1077 ms | 788 ms | tile_0=192 tile_1=2 tile_2=2 |
| 3D, LayoutLeft, each dim. = 256 | 2557 ms | 1902 ms | tile_0=256 tile_1=2 tile_2=2 |
| 4D, LayoutRight, each dim. = 32 | 229 ms | 151 ms | tile_0=2 tile_1=2 tile_2=2 tile_3=32 |
| 4D, LayoutRight, each dim. = 64 | 3348 ms | 2386 ms | tile_0=2 tile_1=2 tile_2=2 tile_3=64 |
| 4D, LayoutLeft, each dim. = 32 | 216 ms | 161 ms | tile_0=32 tile_1=2 tile_2=2 tile_3=2 |
| 4D, LayoutLeft, each dim. = 64 | 3368 ms | 2514 ms | tile_0=64 tile_1=2 tile_2=2 tile_3=2 |
The title should mention that this only concerns the serial backend.
The title should mention that this only concerns the
serialbackend.
This work could be useful for multi-core CPUs also.
This work could be useful for multi-core CPUs also.
I'd be curious how you would distribute the execution range for host parallel backends with this implementation.
I'd be curious how you would distribute the execution range for host parallel backends with this implementation.
There are no dependencies between the loop iterations, except for fetching data in a cache friendly manner. Parallelization in terms of batches of "higher level" iterations could help.
Also, the results in #8652 are quite good for the RangePolicy with a for loop inside the functor. One of the reasons could be better vectorization with -O3 level optimization. I will add _Pragma(ivdep) to the inner most loop like it is done in core/src/impl/KokkosExp_Host_IterateTile.hpp. I will also re-run that test with the flags Kokkos_ENABLE_AGGRESSIVE_VECTORIZATION and Kokkos_ENABLE_PRAGMA_IVDEP, to see whether the performance for current MDRangePolicy can be improved for Serial backend.
Do you have a test for no-tile version?
I will add test(s).