kokkos icon indicating copy to clipboard operation
kokkos copied to clipboard

MDRangePolicy: Nested loops w/o tiles; Host backends (begin with Serial)

Open science-enthusiast opened this issue 1 month ago • 8 comments

Issue 8652 reported a performance issue concerning MDRangePolicy in Serial backend.

Two approaches were tried to iterate through the elements of a View via MDRangePolicy. The approach based on a direct nested for loop over the elements (without tiles) is showing consistent speed improvement in the benchmarks tested so far (see below).

Please share your thoughts on how to choose between various approaches like looping over tiles, nested for loops without tiles, flat loop with index reconstruction etc. Currently, the approach in this PR is chosen if the tile dimensions are all set to 1, otherwise, the existing approach (a single loop over the tiles) is chosen.

@crtrott was also mentioning that this approach could be useful for multi-core CPU backends.

science-enthusiast avatar Dec 03 '25 14:12 science-enthusiast

Stencils based example shared in https://github.com/kokkos/kokkos/issues/8652. Totally there are three stencil computations. All involving 2D views. There are multiple Views involved, with only a few of them shared across the stencils.

Only timings regarding MDRange are expected to change.

Size of each dimension of the 2D Views: 6144. Tile dimensions: 8 X 256

Current tiles based approach Directly iterating over the elements without tiles
Time: Range 3 Kernels 1.255342 s 1.271328 s
Time: MDRange 3 Kernels 2.196737 s 1.853119 s
Time: Range 1 Kernel 1.417772 s 1.425144 s
Time: MDRange 1 Kernel 2.901179 s 2.291974 s
Time: MDRange Flat 1 Kernel 2.705818 s 2.732882 s
Time: Team Policy 1.407209 s 1.417905 s

Size of each dimension of the 2D Views: 8193. Tile dimensions: 8 X 256

Current tiles based approach Directly iterating over the elements without tiles
Time: Range 3 Kernels 2.221814 s 2.213865 s
Time: MDRange 3 Kernels 3.976642 s 3.359611 s
Time: Range 1 Kernel 2.511044 s 2.535955 s
Time: MDRange 1 Kernel 4.690899 s 4.040167 s
Time: MDRange Flat 1 Kernel 4.793878 s 4.884660 s
Time: Team Policy 2.488391 s 2.491935 s

science-enthusiast avatar Dec 03 '25 14:12 science-enthusiast

Stream benchmark: based on PR8462 (2c3ba82ab1e87705b5a420a23f14e2d1d59f3771) In all cases, the total number of elements in the View = $16^6$. Note: Unlike the PR, the parallel_for was launched 100 times to get numbers in the order of seconds.

Operation<Rank> Tiles No Tiles
Copy<1> 1055 ms 968 ms FOM: GB/s=0.277449/s MB=268.435
Copy<2> 1065 ms 1036 ms FOM: GB/s=0.259217/s MB=268.435
Copy<3> 1015 ms 1043 ms FOM: GB/s=0.257328/s MB=268.435
Copy<4> 1185 ms 1039 ms FOM: GB/s=0.258273/s MB=268.435
Copy<5> 1167 ms 1038 ms FOM: GB/s=0.258493/s MB=268.435
Copy<6> 1214 ms 1068 ms FOM: GB/s=0.251298/s MB=268.435
Set<1> 801 ms 746 ms FOM: GB/s=0.179817/s MB=134.218
Set<2> 799 ms 741 ms FOM: GB/s=0.181102/s MB=134.218
Set<3> 715 ms 742 ms FOM: GB/s=0.180931/s MB=134.218
Set<4> 771 ms 739 ms FOM: GB/s=0.181647/s MB=134.218
Set<5> 691 ms 743 ms FOM: GB/s=0.180738/s MB=134.218
Set<6> 671 ms 749 ms FOM: GB/s=0.179249/s MB=134.218
Add<1> 1440 ms 1206 ms FOM: GB/s=0.333874/s MB=402.653
Add<2> 1442 ms 1305 ms FOM: GB/s=0.308443/s MB=402.653
Add<3> 1387 ms 1299 ms FOM: GB/s=0.309954/s MB=402.653
Add<4> 1892 ms 1302 ms FOM: GB/s=0.309343/s MB=402.653
Add<5> 1909 ms 1302 ms FOM: GB/s=0.309343/s MB=402.653
Add<6> 1696 ms 1413 ms FOM: GB/s=0.284918/s MB=402.653
Scale<1> 1067 ms 956 ms FOM: GB/s=0.280883/s MB=268.435
Scale<2> 1047 ms 1046 ms FOM: GB/s=0.25656/s MB=268.435
Scale<3> 1021 ms 1043 ms FOM: GB/s=0.257302/s MB=268.435
Scale<4> 1272 ms 1046 ms FOM: GB/s=0.256544/s MB=268.435
Scale<5> 1212 ms 1047 ms FOM: GB/s=0.256409/s MB=268.435
Scale<6> 1218 ms 1073 ms FOM: GB/s=0.25028/s MB=268.435
Triad<1> 1383 ms 1219 ms FOM: GB/s=0.330189/s MB=402.653
Triad<2> 1396 ms 1314 ms FOM: GB/s=0.306464/s MB=402.653
Triad<3> 1373 ms 1321 ms FOM: GB/s=0.304741/s MB=402.653
Triad<4> 1808 ms 1360 ms FOM: GB/s=0.296073/s MB=402.653
Triad<5> 1889 ms 1404 ms FOM: GB/s=0.286864/s MB=402.653
Triad<6> 1653 ms 1397 ms FOM: GB/s=0.288239/s MB=402.653

science-enthusiast avatar Dec 03 '25 14:12 science-enthusiast

Stencil benchmark: based on PR8458 (ac90be74dd56b18d71f6181684975a83331fccf6). Default tiling was used in all cases. Note: Unlike the PR, the parallel_for was launched 100 times to get bigger total time durations.

  Tiles No Tiles  
2D, LayoutRight, each dim. = 512 29.3 ms 21.1 ms tile_0=2 tile_1=512
2D, LayoutRight, each dim. = 1024 118 ms 85.9 ms tile_0=2 tile_1=1.024k
2D, LayoutRight, each dim. = 2048 482 ms 372 ms tile_0=2 tile_1=2.048k
2D, LayoutRight, each dim. = 4096 1937 ms 1509 ms tile_0=2 tile_1=4.096k
2D, LayoutLeft, each dim. = 512 29.5 ms 21.7 ms tile_0=512 tile_1=2
2D, LayoutLeft, each dim. = 1024 118 ms 85.3 ms tile_0=1.024k tile_1=2
2D, LayoutLeft, each dim. = 2048 480 ms 371 ms tile_0=2.048k tile_1=2
2D, LayoutLeft, each dim. = 4096 1931 ms 1504 ms tile_0=4.096k tile_1=2
3D, LayoutRight, each dim. = 128 328 ms 233 ms tile_0=2 tile_1=2 tile_2=128
3D, LayoutRight, each dim. = 192 1048 ms 795 ms tile_0=2 tile_1=2 tile_2=192
3D, LayoutRight, each dim. = 256 2505 ms 1937 ms tile_0=2 tile_1=2 tile_2=256
3D, LayoutLeft, each dim. = 128 316 ms 230 ms tile_0=128 tile_1=2 tile_2=2
3D, LayoutLeft, each dim. = 192 1077 ms 788 ms tile_0=192 tile_1=2 tile_2=2
3D, LayoutLeft, each dim. = 256 2557 ms 1902 ms tile_0=256 tile_1=2 tile_2=2
4D, LayoutRight, each dim. = 32 229 ms 151 ms tile_0=2 tile_1=2 tile_2=2 tile_3=32
4D, LayoutRight, each dim. = 64 3348 ms 2386 ms tile_0=2 tile_1=2 tile_2=2 tile_3=64
4D, LayoutLeft, each dim. = 32 216 ms 161 ms tile_0=32 tile_1=2 tile_2=2 tile_3=2
4D, LayoutLeft, each dim. = 64 3368 ms 2514 ms tile_0=64 tile_1=2 tile_2=2 tile_3=2

science-enthusiast avatar Dec 03 '25 14:12 science-enthusiast

The title should mention that this only concerns the serial backend.

AdRi1t avatar Dec 03 '25 14:12 AdRi1t

The title should mention that this only concerns the serial backend.

This work could be useful for multi-core CPUs also.

science-enthusiast avatar Dec 03 '25 14:12 science-enthusiast

This work could be useful for multi-core CPUs also.

I'd be curious how you would distribute the execution range for host parallel backends with this implementation.

masterleinad avatar Dec 03 '25 14:12 masterleinad

I'd be curious how you would distribute the execution range for host parallel backends with this implementation.

There are no dependencies between the loop iterations, except for fetching data in a cache friendly manner. Parallelization in terms of batches of "higher level" iterations could help.

Also, the results in #8652 are quite good for the RangePolicy with a for loop inside the functor. One of the reasons could be better vectorization with -O3 level optimization. I will add _Pragma(ivdep) to the inner most loop like it is done in core/src/impl/KokkosExp_Host_IterateTile.hpp. I will also re-run that test with the flags Kokkos_ENABLE_AGGRESSIVE_VECTORIZATION and Kokkos_ENABLE_PRAGMA_IVDEP, to see whether the performance for current MDRangePolicy can be improved for Serial backend.

science-enthusiast avatar Dec 03 '25 15:12 science-enthusiast

Do you have a test for no-tile version?

I will add test(s).

science-enthusiast avatar Dec 10 '25 08:12 science-enthusiast