quda icon indicating copy to clipboard operation
quda copied to clipboard

Enable support for multiple right hand sides

Open maddyscientist opened this issue 14 years ago • 3 comments

A major optimization that can be explored is to see if amortizing the gauge field loads by acting on multiple spinors simultaneously will lead a large speed up. This is much more important for the staggered dslash kernel since this is the most bound by gauge field loads.

A simple way to explore this is introduce a y dimension to the thread blocks, this corresponds to the number of right hand sides.

The easiest way to explore this is probably to create a 5-dimensional spinor field, where the length of 5th dimension corresponds to the number of right hand sides. I believe the only changes required to the dslash kernel indexing would be adding to the spinor fields an offset corresponding to the y thread index multiplied by the length of the 4-dimensional spinor field.

spinor_index += threadIdx.y * 4d_length;

By definition, the gauge field indexing is independent of threadIdx.y.

Even with this simple change, there will likely be significant improvement in kernel performance since gauge field loads should obtain reuse through the L1 / texture cache / L2. Further improvement is likely possible through using shared memory management.

Beyond the kernel, there are multiple changes and additions required:

  • Modifying the interface to deal with multiple right hand sides
  • Set the texture binding appropriately for the 5-d spinor length
  • Deal with the packing/unpacking of an array of cpu spinor fields
  • Multi-GPU packing / unpacking

By moving a 5-d spinor field this complicates the communications packing and unpacking logic, which currently expects a 4-d spinor field. How to solve this requires some thought. I think a simple solution would be to create the contiguous 5-dimensional spinor fields, but also create an array of 4-dimensional spinor fields which are actually pointers to the corresponding parts of the 5-dimensional spinor field. Thus the communications routines can be called on the 4-dimensional reference spinor fields, but the dslash kernel deals with the 5-dimensional field (aside - this may be a very quick and dirty way to get multi-GPU domain wall).

maddyscientist avatar Sep 28 '11 02:09 maddyscientist

One thing that we can also do: The mixed precision multi shift inverter requires refinement for all higher shifts. If we support using different masses for different source (rhs) we could use a multi source solver also for the refinement step...

mathiaswagner avatar Jun 04 '15 18:06 mathiaswagner

Yes, I've thought of that before. That would be very easy once we have a multi-src solver implemented, and give a nice little performance boost.

maddyscientist avatar Jun 04 '15 21:06 maddyscientist

Yes, nothing spectacular. Just easier if we think of different masses right away. For HISQ one could even think about different Naik eps. But before doing that we need support for passing that from MILC to QUDA.

mathiaswagner avatar Jun 04 '15 22:06 mathiaswagner