Kernels stencil blocking may have foobarred performance...

Need to investigate but the recent commits have shown a massive regression in some cases.

Aug 16 '17 00:08 jeffhammond

That's strange. I used to have tiling for all three stencil implementations, in the good old days of SERIAL, MPI, and OpenMP. But I hardly ever saw a benefit. and it complicated the code, so I eliminated it for all but the serial implementation. In principle it should allow better reuse, but it takes a LARGE grid to see that happen. If performance drops precipitously because of it, there's a pathology (bug).

Aug 16 '17 05:08 rfvander

This was because omitting the blocking argument meant that measurements used star 2 instead of star 4, but we still have to deal with the fact that huge tile sizes led to inadequate parallelism. We should branch on (grid_size/tile_size)^2<num_threads and not bother tiling there.

Aug 16 '17 05:08 jeffhammond

Yes, we saw the same with transpose, as you may recall. But I wouldn't do anything automatic. Users should always be allowed to shoot themselves in the foot.

Aug 16 '17 05:08 rfvander

But maybe we can warn them of the bullet holes.

Aug 16 '17 05:08 rfvander

I meant to ask you if you ever get requests for box-shaped stencils (instead of star stencils). For the AMR code I effectively had to support that in MPI (too complicated to explain why, and not worth it), and it was actually very easy. I'd like to add that to our MPI variants.

Aug 16 '17 05:08 rfvander

TBB wins big time on KNL because of tiling. Tiling helps for dimension 2000-16000 with star radius 4.

Aug 16 '17 05:08 jeffhammond

I am the user and I am protecting my feet by making the code disable tiling when it is going to serialize.

Aug 16 '17 05:08 jeffhammond

My code generator is supposed to support square pattern but there's a bug in it.

Aug 16 '17 05:08 jeffhammond