quda Feature/DD (WIP)

This is a first PR towards enabling domain decomposition (DD) features in QUDA. The goal of this PR is to enable a red-black decomposition for the Dirac operator.

Remarks

We first consider the blocks to exactly fit in the local lattice. Generalization is left to a future PR.
We focus only on the application of the Dirac operator. Optimization of BLAS functions is left to a future PR.

Design

Under domain decomposition, the Dirac operator assumes a block structure, e.g.

$$ D = \begin{bmatrix} D_{rr} & D_{rb} \ D_{br} & D_{bb} \end{bmatrix} $$

Thus we need a simple strategy to implement all possible operators and their application to a field. Our strategy is to add the specifications about the domain decomposition directly to the vector (spinor) field. E.g. the application of $D_{rb}$, i.e. $y = D_{rb} x = P_r D P_b x$, can be expressed by setting the input vector as "black", i.e. $x_b = P_b x$, and the output vector as "red", i.e. $y_r = P_r y$. A pseudocode would look like this:

x.dd_red_active();
y.dd_black_active();
applyD(y,x);

Then, we make the application of D DD-aware and only apply it to the in/out active points.

Summary of changes:

DDParam is added as a property of lattice fields (used only for ColoSpinorFields at the moment).
... TODO

TODO list:

[ ] Add more comments and function documentation
[x] Add checks for parameters, e.g. the block size should divide the local lattice size
[x] ~Add calculation of first block parity~ Use global coordinates
[x] Test application of individual pieces, i.e. D_rr, D_rb, D_br, D_bb
[ ] Test all operators
[ ] Test that performance without DD is not affected (i.e. compared to current develop)
[ ] Test MR solver and usage in MG as smoother
[x] Properly disable comms when not needed (e.g. block fits local lattice)
[ ] Improve performance by unrolling threads block-wise
[ ] Disable usage of DD for all PC Mat
[ ] ...

Current issues:

[x] ~Tests are failing if the block size is odd in the x direction for the PC operator, e.g. dslash_test --xdim 6 --ydim 4 --zdim 4 --tdim 4 --dd-block-size 3 2 2 2 --dd-red-black=true --test 0. It works if not PC, e.g. --test 2 or if it is odd in any other direction.~ Now it is not allowed to have odd block size in the x-direction.
[x] Tests are failing if export QUDA_REORDER_LOCATION=CPU is set (#1466).
[x] ~Tests are failing for MatPC with Caught signal 8 (Floating point exception: integer divide by zero) . See e.g. dslash_test --xdim 8 --ydim 8 --zdim 8 --tdim 8 --dd-red-black=true --test 1~ Not testing for PC operators

Mar 19 '24 11:03 sbacchio

Hi @maddyscientist , I worked a bit more on this PR. I have been performing tests for all operators. Just facing an issue beyond my reach and it would be good to have your input. Namely, dslash_test fails for the case of Mat matrix and the operators clover-hasenbusch-twist, also in the develop branch. Here is a reproducible: dslash_test --dslash-type clover-hasenbusch-twist --test Mat

It fails with a segmentation fault as follows.

Output:

Disabling GPU-Direct RDMA access Enabling peer-to-peer copy engine and direct load/store access Rank order is column major (t running fastest) [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from DslashTest QUDA 1.1.0 (git 1.1.0-1029d62d5-sm_80) CUDA Driver version = 12010 CUDA Runtime version = 11080 Graphic driver version = 530.30.02 Found device 0: NVIDIA A100-SXM-64GB Using device 0: NVIDIA A100-SXM-64GB WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU) WARNING: Environment variable QUDA_RESOURCE_PATH is not set. WARNING: Caching of tuned parameters will be disabled. WARNING: Using device memory pool allocator WARNING: Using pinned memory pool allocator cublasCreated successfully [ RUN ] DslashTest.benchmark Sending clover field to GPU Randomizing fields... Sending gauge field to GPU Creating cudaSpinor with nParity = 2 Creating cudaSpinorOut with nParity = 2 Sending spinor field to GPU Source 0: CPU = 2.653920e+06, CUDA = 2.653920e+06 running the following test: prec recon dtest_type matpc_type dagger S_dim T_dimension Ls_dimension dslash_type niter single 18 Mat even_even 0 24/ 24/ 24 24 16 clover_hasenbusch_twist 100 Grid partition info: X Y Z T 0 0 0 0 Tuning... [lrdn3448:783349:0:783349] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xf0)

Jul 08 '24 15:07 sbacchio

@sbacchio yep, don't worry about that failing test. I don't think that operator has a non-preconditioned form properly implemented.

Jul 09 '24 06:07 maddyscientist

This is looking very promising with the full file parallelization of precision, color, etc. through CMake. As is evidenced by the failing builds in the CI, this is still buggy and needs to fixed.

Some minor comments:

For file naming, keep to lower case (double not DOUBLE)
For color filename encoding, I would prefer _nc_3 for 3 colors

Oct 07 '24 11:10 maddyscientist

I'm starting to do some testing on this PR, and I see that not all Dirac operators have the file parallelization. Is this something you can do? See the trace below, for a full build of QUDA, we can see that these files dominate compilation time, and result in a much longer compilation time. At the same time, it's clear how much faster the split Dirac operator files compile on a multi-core system 😄

Nov 05 '24 19:11 maddyscientist

I'm starting to do some testing on this PR, and I see that not all Dirac operators have the file parallelization. Is this something you can do? See the trace below, for a full build of QUDA, we can see that these files dominate compilation time, and result in a much longer compilation time. At the same time, it's clear how much faster the split Dirac operator files compile on a multi-core system 😄

Hi, yes, sorry, I will do the remaining ones.

Nov 05 '24 19:11 pittlerf

I have fixed the failing staggered dslash tests (was due to an issue with the long-link field being created erroneously when using regular (unimproved) staggered fermions.

Nov 06 '24 01:11 maddyscientist

Hi Kate, that's an impressive tracing :) Ferenc will work on the others soon. About the tests, I still see some staggered tests failing (invert and eigensolve). I guess due to similar reasons. Maybe it's better if you have a look at those too :)

Nov 06 '24 06:11 sbacchio

Looking great with latest pushes. Just dslash_twisted_mass_preconditioned.cu left I think

Nov 07 '24 19:11 maddyscientist

Great! and I see now also all 7 checks passed :) Thanks @pittlerf ! @maddyscientist do you have also a trace of the compilation time of the current develop? Just to appreciate the overall improvement :)

Nov 07 '24 19:11 sbacchio

Checking where we are with this PR: I believe we still need the DD tests added to the ctest list so that DD is tested if enabled?

Jan 21 '25 21:01 maddyscientist

Hi, apologize for the delay here. We have included tests for all the DD supported operators to the ctest, additionally we wrote a small assert function to prevent using DD with not supported operators.

Feb 05 '25 08:02 pittlerf

Hi @maddyscientist, the last commits by Ferenc a few weeks ago solved all the pending issues and put guards for the non-supported cases. Would be nice to get a final review when you have time :)

Mar 06 '25 08:03 sbacchio

Thanks @sbacchio. I think we are good to go.

At the moment the CSCS CI is broken, and I want to get that fixed in a separate PR, and then have that merged in to this branch before we merge. So if you can hang on another few days.

Thanks @pittlerf for all the work getting this over the line.

Mar 10 '25 16:03 maddyscientist

@pittlerf and/or @sbacchio , can you re-merge develop and resolve the conflicts? Once you can take care of that and (re-)verify it passes, I'll do one more quick review and get this merged in promptly.

Mar 20 '25 19:03 weinbe2

@maddyscientist the staggered tests are failing, independently on the DD. Is it something we did or a known issue?

Mar 25 '25 12:03 sbacchio

@maddyscientist the staggered tests are failing, independently on the DD. Is it something we did or a known issue?

The failure was caused by this change in tests/staggered_dslash_test_util.h, where cpuLong is created conditionally. Was this change intentional or required? The cpu host staggered code always assumes the presence of the long link field,e even when we're doing naive staggered.

if (dslash_type == QUDA_ASQTAD_DSLASH) cpuLong = GaugeField(cpuLongParam);

Mar 27 '25 06:03 maddyscientist

I've noticed another issue that needs to be taken care of, beyond the failure above. It seems that conditional compilation is broken now, e.g., all Dirac operators are presently built, despite the settings of QUDA_DIRAC_WILSON, etc. This needs to be fixed before we can merge.

Mar 27 '25 06:03 maddyscientist

I've noticed another issue that needs to be taken care of, beyond the failure above. It seems that conditional compilation is broken now, e.g., all Dirac operators are presently built, despite the settings of QUDA_DIRAC_WILSON, etc. This needs to be fixed before we can merge.

Hi @maddyscientist , I have resolved the compilation issue, and attempted to fix the staggered issue. I tried to remove the if condition from the commit a7fe5bd57249d2fc0fc8533daac131af296cdd38, however the staggered tests are still failing with the following error: ERROR: Halo depth 3 is greater than local lattice x = {4 2 6 8} (rank 0, host jwb0085.juwels, gauge_field.cpp:85 in void quda::GaugeField::create(const quda::GaugeFieldParam&)())

Do you know this issue?

Mar 28 '25 10:03 pittlerf

I've noticed another issue that needs to be taken care of, beyond the failure above. It seems that conditional compilation is broken now, e.g., all Dirac operators are presently built, despite the settings of QUDA_DIRAC_WILSON, etc. This needs to be fixed before we can merge.

Hi @maddyscientist , I have resolved the compilation issue, and attempted to fix the staggered issue. I tried to remove the if condition from the commit a7fe5bd, however the staggered tests are still failing with the following error: ERROR: Halo depth 3 is greater than local lattice x = {4 2 6 8} (rank 0, host jwb0085.juwels, gauge_field.cpp:85 in void quda::GaugeField::create(const quda::GaugeFieldParam&)())

Do you know this issue?

We found this fix. Is it OK @maddyscientist ?

Mar 28 '25 13:03 pittlerf

Looks like we have passing staggered tests again, so that's good. The issue remains however that all Dirac operators are still being built, regardless of the DIRAC CMake settings.

Apr 11 '25 18:04 maddyscientist

Looks like we have passing staggered tests again, so that's good. The issue remains however that all Dirac operators are still being built, regardless of the DIRAC CMake settings.

Hi, thanks for the comment. For the DIRAC CMake settings, as I understand for example the Domain Wall in "lib/dslash_domain_wall_4d_m5inv.hpp" there is an if statement: if constexpr (is_enabled<QUDA_DOMAIN_WALL_4D_DSLASH>()) , which if its not true the code will return an error:errorQuda("Domain-wall operator has not been built");. Am I understand this correctly?

Apr 11 '25 18:04 pittlerf

Hi, thanks for the comment. For the DIRAC CMake settings, as I understand for example the Domain Wall in "lib/dslash_domain_wall_4d_m5inv.hpp" there is an if statement: if constexpr (is_enabled<QUDA_DOMAIN_WALL_4D_DSLASH>()) , which if its not true the code will return an error:errorQuda("Domain-wall operator has not been built");. Am I understand this correctly?

Yes, that's correct.

Apr 11 '25 19:04 maddyscientist

I have addressed the issue with the Dirac operators being compiled when they shouldn't be, and also fixed issues related to NVSHMEM linking. Things are looking good. I might clean up the dslash code generation bit in the future, but that's minor work, I can do after merge.

@weinbe2 can you review this PR please?

Apr 17 '25 05:04 maddyscientist

@sbacchio @pittlerf I'm beginning to go through this PR and I'm very impressed with all of the work done here. When you have the chance, can you please make a wiki page that describes the DD work and the current state of it?

The description can just be what you included in the PR description (remarks/design), and then an example command with an explanation---just something from one of the ctests, for ex---would be appreciated.

Apr 23 '25 21:04 weinbe2

I'd like to better understand what's going on with the #ifdef SIGNATURE_ONLY pattern across many of the dslash files; at a high level I can sort of see what it's accomplishing, reducing code duplication, but it's messy from a software engineering perspective and I feel like there must be a more C++-y way to accomplish it. Thoughts @maddyscientist ?

Apr 24 '25 14:04 weinbe2

I just completed a visual review, and needless to say this is very impressive work---I left some QUDA style convention comments, and a few questions. I think the most important piece of feedback I have is about SIGNATURE_ONLY; I'd really like something neater than that.

My thoughts are all orthogonal to doing actual code testing, so I'll carry on with that next.

Apr 24 '25 14:04 weinbe2

FYI @pittlerf re:SIGNATURE_ONLY, @maddyscientist is going to look into it briefly and take care of it if there's an easy fix, otherwise we'll merge this and deal with it later---I don't want to postpone merging this PR much longer :)

Apr 28 '25 21:04 weinbe2

I just got done with the boolean op replacement, done by hand instead of with clang-tidy because I figured out very quickly that it was going to be much more complicated than I initially thought. I also elected to rename include/kernels/color_spinor_project_dd.cuh to *domain_decomp.cuh because I kept confusing it with the _dd, _ds, etc files for copying between precisions and I wanted to disambiguate that.

Apr 29 '25 20:04 weinbe2

quda quda copied to clipboard

Feature/DD (WIP)

Remarks

Design

Summary of changes:

TODO list:

Current issues:

quda
quda copied to clipboard