stencil blocking may have foobarred performance...
Need to investigate but the recent commits have shown a massive regression in some cases.
That's strange. I used to have tiling for all three stencil implementations, in the good old days of SERIAL, MPI, and OpenMP. But I hardly ever saw a benefit. and it complicated the code, so I eliminated it for all but the serial implementation. In principle it should allow better reuse, but it takes a LARGE grid to see that happen. If performance drops precipitously because of it, there's a pathology (bug).
This was because omitting the blocking argument meant that measurements used star 2 instead of star 4, but we still have to deal with the fact that huge tile sizes led to inadequate parallelism. We should branch on (grid_size/tile_size)^2<num_threads and not bother tiling there.
Yes, we saw the same with transpose, as you may recall. But I wouldn't do anything automatic. Users should always be allowed to shoot themselves in the foot.
But maybe we can warn them of the bullet holes.
I meant to ask you if you ever get requests for box-shaped stencils (instead of star stencils). For the AMR code I effectively had to support that in MPI (too complicated to explain why, and not worth it), and it was actually very easy. I'd like to add that to our MPI variants.
TBB wins big time on KNL because of tiling. Tiling helps for dimension 2000-16000 with star radius 4.
I am the user and I am protecting my feet by making the code disable tiling when it is going to serialize.
My code generator is supposed to support square pattern but there's a bug in it.