kokkos MDRangePolicy: Nested loops w/o tiles; Host backends (begin with Serial)

Issue 8652 reported a performance issue concerning MDRangePolicy in Serial backend.

Two approaches were tried to iterate through the elements of a View via MDRangePolicy. The approach based on a direct nested for loop over the elements (without tiles) is showing consistent speed improvement in the benchmarks tested so far (see below).

Please share your thoughts on how to choose between various approaches like looping over tiles, nested for loops without tiles, flat loop with index reconstruction etc. Currently, the approach in this PR is chosen if the tile dimensions are all set to 1, otherwise, the existing approach (a single loop over the tiles) is chosen.

@crtrott was also mentioning that this approach could be useful for multi-core CPU backends.

Dec 03 '25 14:12 science-enthusiast

Stencils based example shared in https://github.com/kokkos/kokkos/issues/8652. Totally there are three stencil computations. All involving 2D views. There are multiple Views involved, with only a few of them shared across the stencils.

Only timings regarding MDRange are expected to change.

Size of each dimension of the 2D Views: 6144. Tile dimensions: 8 X 256

	Current tiles based approach	Directly iterating over the elements without tiles
Time: Range 3 Kernels	1.255342 s	1.271328 s
Time: MDRange 3 Kernels	2.196737 s	1.853119 s
Time: Range 1 Kernel	1.417772 s	1.425144 s
Time: MDRange 1 Kernel	2.901179 s	2.291974 s
Time: MDRange Flat 1 Kernel	2.705818 s	2.732882 s
Time: Team Policy	1.407209 s	1.417905 s

Size of each dimension of the 2D Views: 8193. Tile dimensions: 8 X 256

	Current tiles based approach	Directly iterating over the elements without tiles
Time: Range 3 Kernels	2.221814 s	2.213865 s
Time: MDRange 3 Kernels	3.976642 s	3.359611 s
Time: Range 1 Kernel	2.511044 s	2.535955 s
Time: MDRange 1 Kernel	4.690899 s	4.040167 s
Time: MDRange Flat 1 Kernel	4.793878 s	4.884660 s
Time: Team Policy	2.488391 s	2.491935 s

Dec 03 '25 14:12 science-enthusiast

Stream benchmark: based on PR8462 (2c3ba82ab1e87705b5a420a23f14e2d1d59f3771) In all cases, the total number of elements in the View = $16^6$. Note: Unlike the PR, the parallel_for was launched 100 times to get numbers in the order of seconds.

Operation<Rank>	Tiles	No Tiles
Copy<1>	1055 ms	968 ms	FOM: GB/s=0.277449/s MB=268.435
Copy<2>	1065 ms	1036 ms	FOM: GB/s=0.259217/s MB=268.435
Copy<3>	1015 ms	1043 ms	FOM: GB/s=0.257328/s MB=268.435
Copy<4>	1185 ms	1039 ms	FOM: GB/s=0.258273/s MB=268.435
Copy<5>	1167 ms	1038 ms	FOM: GB/s=0.258493/s MB=268.435
Copy<6>	1214 ms	1068 ms	FOM: GB/s=0.251298/s MB=268.435
Set<1>	801 ms	746 ms	FOM: GB/s=0.179817/s MB=134.218
Set<2>	799 ms	741 ms	FOM: GB/s=0.181102/s MB=134.218
Set<3>	715 ms	742 ms	FOM: GB/s=0.180931/s MB=134.218
Set<4>	771 ms	739 ms	FOM: GB/s=0.181647/s MB=134.218
Set<5>	691 ms	743 ms	FOM: GB/s=0.180738/s MB=134.218
Set<6>	671 ms	749 ms	FOM: GB/s=0.179249/s MB=134.218
Add<1>	1440 ms	1206 ms	FOM: GB/s=0.333874/s MB=402.653
Add<2>	1442 ms	1305 ms	FOM: GB/s=0.308443/s MB=402.653
Add<3>	1387 ms	1299 ms	FOM: GB/s=0.309954/s MB=402.653
Add<4>	1892 ms	1302 ms	FOM: GB/s=0.309343/s MB=402.653
Add<5>	1909 ms	1302 ms	FOM: GB/s=0.309343/s MB=402.653
Add<6>	1696 ms	1413 ms	FOM: GB/s=0.284918/s MB=402.653
Scale<1>	1067 ms	956 ms	FOM: GB/s=0.280883/s MB=268.435
Scale<2>	1047 ms	1046 ms	FOM: GB/s=0.25656/s MB=268.435
Scale<3>	1021 ms	1043 ms	FOM: GB/s=0.257302/s MB=268.435
Scale<4>	1272 ms	1046 ms	FOM: GB/s=0.256544/s MB=268.435
Scale<5>	1212 ms	1047 ms	FOM: GB/s=0.256409/s MB=268.435
Scale<6>	1218 ms	1073 ms	FOM: GB/s=0.25028/s MB=268.435
Triad<1>	1383 ms	1219 ms	FOM: GB/s=0.330189/s MB=402.653
Triad<2>	1396 ms	1314 ms	FOM: GB/s=0.306464/s MB=402.653
Triad<3>	1373 ms	1321 ms	FOM: GB/s=0.304741/s MB=402.653
Triad<4>	1808 ms	1360 ms	FOM: GB/s=0.296073/s MB=402.653
Triad<5>	1889 ms	1404 ms	FOM: GB/s=0.286864/s MB=402.653
Triad<6>	1653 ms	1397 ms	FOM: GB/s=0.288239/s MB=402.653

Dec 03 '25 14:12 science-enthusiast

Stencil benchmark: based on PR8458 (ac90be74dd56b18d71f6181684975a83331fccf6). Default tiling was used in all cases. Note: Unlike the PR, the parallel_for was launched 100 times to get bigger total time durations.

	Tiles	No Tiles
2D, LayoutRight, each dim. = 512	29.3 ms	21.1 ms	tile_0=2 tile_1=512
2D, LayoutRight, each dim. = 1024	118 ms	85.9 ms	tile_0=2 tile_1=1.024k
2D, LayoutRight, each dim. = 2048	482 ms	372 ms	tile_0=2 tile_1=2.048k
2D, LayoutRight, each dim. = 4096	1937 ms	1509 ms	tile_0=2 tile_1=4.096k
2D, LayoutLeft, each dim. = 512	29.5 ms	21.7 ms	tile_0=512 tile_1=2
2D, LayoutLeft, each dim. = 1024	118 ms	85.3 ms	tile_0=1.024k tile_1=2
2D, LayoutLeft, each dim. = 2048	480 ms	371 ms	tile_0=2.048k tile_1=2
2D, LayoutLeft, each dim. = 4096	1931 ms	1504 ms	tile_0=4.096k tile_1=2
3D, LayoutRight, each dim. = 128	328 ms	233 ms	tile_0=2 tile_1=2 tile_2=128
3D, LayoutRight, each dim. = 192	1048 ms	795 ms	tile_0=2 tile_1=2 tile_2=192
3D, LayoutRight, each dim. = 256	2505 ms	1937 ms	tile_0=2 tile_1=2 tile_2=256
3D, LayoutLeft, each dim. = 128	316 ms	230 ms	tile_0=128 tile_1=2 tile_2=2
3D, LayoutLeft, each dim. = 192	1077 ms	788 ms	tile_0=192 tile_1=2 tile_2=2
3D, LayoutLeft, each dim. = 256	2557 ms	1902 ms	tile_0=256 tile_1=2 tile_2=2
4D, LayoutRight, each dim. = 32	229 ms	151 ms	tile_0=2 tile_1=2 tile_2=2 tile_3=32
4D, LayoutRight, each dim. = 64	3348 ms	2386 ms	tile_0=2 tile_1=2 tile_2=2 tile_3=64
4D, LayoutLeft, each dim. = 32	216 ms	161 ms	tile_0=32 tile_1=2 tile_2=2 tile_3=2
4D, LayoutLeft, each dim. = 64	3368 ms	2514 ms	tile_0=64 tile_1=2 tile_2=2 tile_3=2

Dec 03 '25 14:12 science-enthusiast

The title should mention that this only concerns the serial backend.

Dec 03 '25 14:12 AdRi1t

The title should mention that this only concerns the serial backend.

This work could be useful for multi-core CPUs also.

Dec 03 '25 14:12 science-enthusiast

This work could be useful for multi-core CPUs also.

I'd be curious how you would distribute the execution range for host parallel backends with this implementation.

Dec 03 '25 14:12 masterleinad

I'd be curious how you would distribute the execution range for host parallel backends with this implementation.

There are no dependencies between the loop iterations, except for fetching data in a cache friendly manner. Parallelization in terms of batches of "higher level" iterations could help.

Also, the results in #8652 are quite good for the RangePolicy with a for loop inside the functor. One of the reasons could be better vectorization with -O3 level optimization. I will add _Pragma(ivdep) to the inner most loop like it is done in core/src/impl/KokkosExp_Host_IterateTile.hpp. I will also re-run that test with the flags Kokkos_ENABLE_AGGRESSIVE_VECTORIZATION and Kokkos_ENABLE_PRAGMA_IVDEP, to see whether the performance for current MDRangePolicy can be improved for Serial backend.

Dec 03 '25 15:12 science-enthusiast

Do you have a test for no-tile version?

I will add test(s).

Dec 10 '25 08:12 science-enthusiast