amrex Better Vector/Tensor Fields in MultiFabs

I was just reminded that for vector and tensor fields, it makes totally sense to have them in a AoS format. That way, accessing them for read and write does not touch multiple far away locations in memory and one can read longer chunks together. E.g., blast from a 14-year younger project.

How would one do this with AMReX MultiFabs for a Yee cell?

As far as I understand it, AMReX MultiFab:

spread their components out and
do not support staggering that differs between the components.

This is the anti-pattern of efficient memory access on any platform for vector/tensor fields 😅

We should provide a better MultiFab (maybe on top of it) that overcomes the limitations above -- and ideally like our particle containers is level aware, while we are on it.

Dec 01 '25 23:12 ax3l

For CUDA GPUs, memory can be accessed in 4, 8 or 16-byte chunks, with the next thread reading the next chunk of memory. Everything else is slower. If you were to store the E and B fields as a six-component AoS, every read and write would need to contain a shuffle or transpose shared memory for optimal performance on GPUs.

Dec 01 '25 23:12 AlexanderSinn

Providing a Multifab that allows different staggering between components would be extremely useful.

For that specifically, I think the current component layout will work, if we allow a extra buffer elements to be allocated that aren't used for some staggerings. Even with the extra buffer overhead, this would simplify the code so much that we would probably switch our code to use it.

Dec 01 '25 23:12 BenWibking

Yes. Basically, treat node-centered data the same as cell-centered data, just with one extra ghost cell in the hi direction and a 0.5 dx offset when converting to and from positions. I'm not entirely sure, however, if this would be treated correctly by FillBoundary, etc.

Dec 01 '25 23:12 AlexanderSinn

For CUDA GPUs, memory can be accessed in 4, 8 or 16-byte chunks, with the next thread reading the next chunk of memory. Everything else is slower. If you were to store the E and B fields as a six-component AoS, every read and write would need to contain a shuffle or transpose shared memory for optimal performance on GPUs.

That's true. I remember we experimented with float4 and the alignment benefit did not outweigh the extra 25% memory transfer. Another solution to get back to N x 16-byte size chunks is to not map threads on field positions 1:1 but let each CUDA thread process two adjacent field points... (2x 3 double = 3x 16-byte chunks... but for float you need to go to 4x to the next common denominator)

That said, FDTD stencils and PIC particle gather and scatter operations are very amendable to shared memory implementations. For WarpX, we would also need to check again how spectral solvers access the memory, they might operate per component...

Dec 02 '25 00:12 ax3l