Tests/Vectorization Scaling
When i play with the case in Tests/Vectorization, the computation time will double or more if the mpi ranks increase from 1 to 10. It is a little confusing since there is no data communication in this case, and the work of each core is the same.
This test isn't setup to do any domain decomposition or parallelization - it's based on FArrayBox, not MultiFab. If you run it on more MPI tasks, every task is duplicating all the work, and contending for the same resources. I'd expect a slowdown in that case.
Since all the MPI task is doing the same work, I expect the same time or a little more time since there may be some reduction in cache hit. However, I find the computation time increase to double or more. This is what confused me.
The cache effects won't necessarily be small, though, if these kernels are spending a significant amount of time streaming data from main memory. In fact, if I reduce the size of the problem to 15^3, then (on my particular processor) I get the same runtime for (1, 2, 4) MPI tasks, as you would expect based on the compute work alone.