DDalphaAMG_nd branch convergence issues
@sbacchio @Finkenrath
Over the last few days I've had some time to try to understand an issue which has been bugging me a bit because I would like to run with the TM2p1p1 branch of sbacchio/DDalphaAMG and the corresponding head commit of the DDalphaAMG_nd branch of Finkenrath/tmLQCD to help with convergence in the heavy sector. However, I'm finding severe convergence problems and further issues. First a comparison to a working setup:
When I set up the head commit of the master branch of sbacchio/DDalphaAMG together with the the head commit of the master branch of FInkenrath/tmLQCD, I get great convergence in the light sector and the expected iteration counts for the given aggregation and scale parameters.
Doing the same with the aforementioned branches for 2+1+1 results in solves which do not converge and output which I have not seen before:
----------------------------------------------------------+
| 3-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.430216 |
| csw: +1.740000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 48 24 24 24 |
| local lattice: 12 6 6 6 |
| block lattice: 3 3 3 3 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 5 |
| test vectors: 20 |
+---------------------- depth 1 --------------------------+
| global lattice: 16 8 8 8 |
| local lattice: 4 2 2 2 |
| block lattice: 2 2 2 2 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 28 |
+---------------------- depth 2 --------------------------+
| global lattice: 8 4 4 4 |
| local lattice: 2 1 1 1 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.012000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +0.007827+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.008543+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.008483+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.008835+0.000000i
[...]
depth: 1, iter: 1, p->H(1,0) = +0.009761+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.072938 seconds
depth: 1, time spent for setting up next coarser operator: 0.057122 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.063018 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.057935 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.082971 seconds
performed 5 iterative setup steps
elapsed time: 13.714705 seconds (2.121091 seconds on coarse grid)
DDalphaAMG setup ran, time 15.61 sec (13.59 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites
+----------------------------------------------------------+
depth: 1, iter: 1, p->H(1,0) = +0.008605+0.000000i
| approx. rel. res. after 1 iterations: 2.686549e-02 |
| approx. rel. res. after 2 iterations: 9.386865e-03 |
| approx. rel. res. after 3 iterations: 3.141994e-03 |
| approx. rel. res. after 4 iterations: 1.246548e-03 |
| approx. rel. res. after 5 iterations: 4.854671e-04 |
| approx. rel. res. after 6 iterations: 1.898306e-04 |
| approx. rel. res. after 7 iterations: 7.727864e-05 |
| approx. rel. res. after 8 iterations: 3.056149e-05 |
| approx. rel. res. after 9 iterations: 1.221386e-05 |
| approx. rel. res. after 10 iterations: 4.911786e-06 |
| approx. rel. res. after 11 iterations: 1.944398e-06 |
| approx. rel. res. after 12 iterations: 7.717114e-07 |
| approx. rel. res. after 13 iterations: 3.055015e-07 |
| approx. rel. res. after 14 iterations: 1.214677e-07 |
| approx. rel. res. after 15 iterations: 4.836682e-08 |
| approx. rel. res. after 16 iterations: 1.907075e-08 |
| approx. rel. res. after 17 iterations: 7.568452e-09 |
| approx. rel. res. after 18 iterations: 3.016249e-09 |
| approx. rel. res. after 19 iterations: 1.199059e-09 |
| approx. rel. res. after 20 iterations: 4.778359e-10 |
| approx. rel. res. after 21 iterations: 1.885605e-10 |
| approx. rel. res. after 22 iterations: 7.484878e-11 |
| approx. rel. res. after 23 iterations: 2.994289e-11 |
+----------------------------------------------------------+
+----------------------------------------------------------+
| FGMRES iterations: 23 coarse average: 3.96 |
| exact relative residual: ||r||/||b|| = 2.994289e-11 |
| elapsed wall clock time: 14.0737 seconds |
| coarse grid time: 6.6641 seconds (47.4%) |
| consumed core minutes*: 6.00e+01 (solve only) |
| max used mem/MPIproc: 1.93e-01 GB |
+----------------------------------------------------------+
To compare, the working setup looks like this:
+----------------------------------------------------------+
| 3-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.430216 |
| csw: +1.740000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 48 24 24 24 |
| local lattice: 12 6 6 6 |
| block lattice: 3 3 3 3 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 5 |
| test vectors: 20 |
+---------------------- depth 1 --------------------------+
| global lattice: 16 8 8 8 |
| local lattice: 4 2 2 2 |
| block lattice: 2 2 2 2 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 28 |
+---------------------- depth 2 --------------------------+
| global lattice: 8 4 4 4 |
| local lattice: 2 1 1 1 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.012000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 0.554985 seconds
depth: 1, time spent for setting up next coarser operator: 0.043112 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.044045 seconds
depth: 0, bootstrap step number 2...
depth: 0, time spent for setting up next coarser operator: 0.558093 seconds
depth: 1, time spent for setting up next coarser operator: 0.045157 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.031808 seconds
depth: 0, bootstrap step number 3...
depth: 0, time spent for setting up next coarser operator: 0.556642 seconds
[...]
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.029956 seconds
depth: 0, bootstrap step number 5...
depth: 0, time spent for setting up next coarser operator: 0.556980 seconds
depth: 1, time spent for setting up next coarser operator: 0.059933 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.028399 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.028356 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.033057 seconds
performed 5 iterative setup steps
elapsed time: 25.091544 seconds (12.091816 seconds on coarse grid)
DDalphaAMG setup ran, time 27.47 sec (44.02 % on coarse grid)
depth: 0, updating mu to 0.000000 on even sites and 0.000000 on odd sites
depth: 1, updating mu to 0.000000 on even sites and 0.000000 on odd sites
depth: 2, updating mu to 0.000000 on even sites and 0.000000 on odd sites
+----------------------------------------------------------+
| approx. rel. res. after 1 iterations: 2.979074e-02 |
| approx. rel. res. after 2 iterations: 8.042268e-03 |
| approx. rel. res. after 3 iterations: 1.584980e-03 |
| approx. rel. res. after 4 iterations: 3.340151e-04 |
| approx. rel. res. after 5 iterations: 7.525576e-05 |
| approx. rel. res. after 6 iterations: 1.551435e-05 |
| approx. rel. res. after 7 iterations: 3.158749e-06 |
| approx. rel. res. after 8 iterations: 7.007767e-07 |
| approx. rel. res. after 9 iterations: 1.494747e-07 |
| approx. rel. res. after 10 iterations: 3.354428e-08 |
| approx. rel. res. after 11 iterations: 7.172643e-09 |
| approx. rel. res. after 12 iterations: 1.493532e-09 |
| approx. rel. res. after 13 iterations: 3.296716e-10 |
| approx. rel. res. after 14 iterations: 7.064493e-11 |
| approx. rel. res. after 15 iterations: 1.588326e-11 |
+----------------------------------------------------------+
+----------------------------------------------------------+
| FGMRES iterations: 15 coarse average: 15.67 |
| exact relative residual: ||r||/||b|| = 1.588326e-11 |
| elapsed wall clock time: 1.5327 seconds |
| coarse grid time: 0.6300 seconds (41.1%) |
| consumed core minutes*: 6.54e+00 (solve only) |
| max used mem/MPIproc: 1.29e-01 GB |
+----------------------------------------------------------+
and is significantly faster, as you can see.
Have you seen this behaviour?
I think the problem is here:
DDalphaAMG setup ran, time 15.61 sec (13.59 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites
there is a big change in mu on the odd sites.. somehow in the setup phase a wrong g_mu3 is used. What is your input file? Which executable are you using?
This is in the HMC, so I would expect that the problematic output is actually correct, this is for the following setup:
BeginDDalphaAMG
MGBlockX = 3
MGBlockY = 3
MGBlockZ = 3
MGBlockT = 3
MGSetupIter = 5
MGCoarseSetupIter = 3
MGNumberOfVectors = 20
MGNumberOfLevels = 3
MGCoarseMuFactor = 3
MGdtauUpdate = 0.0624
MGUpdateSetupIter = 1
MGOMPNumThreads = 1
EndDDalphaAMG
and the following monomial triggers the first solve with ddalphaamg
BeginMonomial CLOVERDETRATIO
Timescale = 2
kappa = 0.1400645
2KappaMu = 0.001120516
# numerator shift
rho = 0.02016936
# denominator shift, should match CLOVERDET shift
rho2 = 0.10420836
CSW = 1.74
MaxSolverIterations = 60000
AcceptancePrecision = 1.e-21
ForcePrecision = 1.e-18
Name = cloverdetratio1light
solver = ddalphaamg
EndMonomial
When I use the problematic version in invert to find optimal parameters, I get the same problems:
DDalphaAMG cnfg set, plaquette 5.432070e-01
DDalphaAMG running setup
initial definition --- depth: 0
depth: 0, time spent for setting up next coarser operator: 0.919600 seconds
initial definition --- depth: 1
depth: 1, time spent for setting up next coarser operator: 0.021723 seconds
initial coarse grid correction is defined
elapsed time: 8.644193 seconds
+----------------------------------------------------------+
| 3-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.430216 |
| csw: +1.740000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 48 24 24 24 |
| local lattice: 12 6 6 6 |
| block lattice: 3 3 3 3 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 4 |
| test vectors: 24 |
+---------------------- depth 1 --------------------------+
| global lattice: 16 8 8 8 |
| local lattice: 4 2 2 2 |
| block lattice: 2 2 2 2 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 28 |
+---------------------- depth 2 --------------------------+
| global lattice: 8 4 4 4 |
| local lattice: 2 1 1 1 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.028000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +0.007809+0.000000i
[...]
depth: 1, iter: 1, p->H(1,0) = +0.009952+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009985+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009946+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.105289 seconds
depth: 1, time spent for setting up next coarser operator: 0.918843 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.019513 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.019073 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.021341 seconds
performed 4 iterative setup steps
elapsed time: 21.283853 seconds (2.872039 seconds on coarse grid)
DDalphaAMG setup ran, time 29.94 sec (9.59 % on coarse grid)
+----------------------------------------------------------+
| approx. rel. res. after 1 iterations: 6.601027e-02 |
| approx. rel. res. after 2 iterations: 3.342283e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009991+0.000000i
| approx. rel. res. after 3 iterations: 2.425828e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009872+0.000000i
| approx. rel. res. after 4 iterations: 1.956784e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009959+0.000000i
| approx. rel. res. after 5 iterations: 1.715145e-02 |
[...] -> no convergence before 600 iterations
while the master branch works rather better
initial definition --- depth: 0
depth: 0, time spent for setting up next coarser operator: 1.172197 seconds
initial definition --- depth: 1
depth: 1, time spent for setting up next coarser operator: 0.116010 seconds
initial coarse grid correction is defined
elapsed time: 4.875110 seconds
+----------------------------------------------------------+
| 3-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.430216 |
| csw: +1.740000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 48 24 24 24 |
| local lattice: 6 6 12 12 |
| block lattice: 3 3 3 3 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 4 |
| test vectors: 24 |
+---------------------- depth 1 --------------------------+
| global lattice: 16 8 8 8 |
| local lattice: 2 2 4 4 |
| block lattice: 2 2 2 2 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 28 |
+---------------------- depth 2 --------------------------+
| global lattice: 8 4 4 4 |
| local lattice: 1 1 2 2 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.028000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 1.151630 seconds
depth: 1, time spent for setting up next coarser operator: 0.109204 seconds
depth: 1, bootstrap step number 1...
[...]
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.104408 seconds
performed 4 iterative setup steps
elapsed time: 62.497544 seconds (38.668907 seconds on coarse grid)
DDalphaAMG setup ran, time 67.38 sec (57.39 % on coarse grid)
+----------------------------------------------------------+
| approx. rel. res. after 1 iterations: 5.388873e-02 |
| approx. rel. res. after 2 iterations: 1.388262e-02 |
| approx. rel. res. after 3 iterations: 3.364761e-03 |
| approx. rel. res. after 4 iterations: 8.359057e-04 |
| approx. rel. res. after 5 iterations: 1.990664e-04 |
| approx. rel. res. after 6 iterations: 4.952127e-05 |
| approx. rel. res. after 7 iterations: 1.263903e-05 |
| approx. rel. res. after 8 iterations: 3.351799e-06 |
| approx. rel. res. after 9 iterations: 8.567047e-07 |
| approx. rel. res. after 10 iterations: 2.091744e-07 |
| approx. rel. res. after 11 iterations: 5.094827e-08 |
| approx. rel. res. after 12 iterations: 1.216494e-08 |
| approx. rel. res. after 13 iterations: 2.904565e-09 |
| approx. rel. res. after 14 iterations: 6.856662e-10 |
+----------------------------------------------------------+
+----------------------------------------------------------+
| FGMRES iterations: 14 coarse average: 292.79 |
| exact relative residual: ||r||/||b|| = 6.856662e-10 |
| elapsed wall clock time: 6.2996 seconds |
| coarse grid time: 4.8121 seconds (76.4%) |
| consumed core minutes*: 1.34e+01 (solve only) |
| max used mem/MPIproc: 2.78e-01 GB |
+----------------------------------------------------------+
Note that between the two runs above, there is a factor of two in the number of processes. However, I see the same problems with the same number of processes, I just don't have results for the particular, exemplary set of parameters.
Hmm I don't like it. It's something we didn't notice on the runs for the Nf=2+1+1 ensamble. And we use the same package setup.
Something you could try, but I don't know if it will work, is to link the branch master of tmLQCD to the DDalphaAMG_nd branch of DDalphaAMG. So we check if the problem is in the interface or in the solver.
I will check the changes I did and try to come out with some idea
Is it because I haven't specified MGNumberOfShifts = 4 ?
Something you could try, but I don't know if it will work, is to link the branch master of tmLQCD to the DDalphaAMG_nd branch of DDalphaAMG. So we check if the problem is in the interface or in the solver.
Will test this out.
It seems that the problem is in DDalphaAMG, rather than the interface. Using the master branch of Finkenrath/tmLQCD together with the TM2p1p1 branch of sbacchio/DDalphaAMG has the same problems as described above:
Problematic:
depth: 1, iter: 1, p->H(1,0) = +0.009670+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009650+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009739+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.073741 seconds
depth: 1, time spent for setting up next coarser operator: 0.042813 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.048123 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.036359 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.130586 seconds
performed 5 iterative setup steps
elapsed time: 13.709875 seconds (2.341967 seconds on coarse grid)
DDalphaAMG setup ran, time 15.94 sec (14.69 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites
+----------------------------------------------------------+
depth: 1, iter: 1, p->H(1,0) = +0.008553+0.000000i
| approx. rel. res. after 1 iterations: 2.693876e-02 |
| approx. rel. res. after 2 iterations: 9.422674e-03 |
| approx. rel. res. after 3 iterations: 3.136621e-03 |
| approx. rel. res. after 4 iterations: 1.244779e-03 |
| approx. rel. res. after 5 iterations: 4.886695e-04 |
| approx. rel. res. after 6 iterations: 1.909823e-04 |
| approx. rel. res. after 7 iterations: 7.708101e-05 |
| approx. rel. res. after 8 iterations: 3.028029e-05 |
| approx. rel. res. after 9 iterations: 1.209484e-05 |
| approx. rel. res. after 10 iterations: 4.876731e-06 |
| approx. rel. res. after 11 iterations: 1.936528e-06 |
| approx. rel. res. after 12 iterations: 7.807262e-07 |
| approx. rel. res. after 13 iterations: 3.124696e-07 |
| approx. rel. res. after 14 iterations: 1.238244e-07 |
| approx. rel. res. after 15 iterations: 4.957753e-08 |
| approx. rel. res. after 16 iterations: 1.986782e-08 |
| approx. rel. res. after 17 iterations: 7.987017e-09 |
| approx. rel. res. after 18 iterations: 3.190570e-09 |
| approx. rel. res. after 19 iterations: 1.264548e-09 |
| approx. rel. res. after 20 iterations: 5.055527e-10 |
| approx. rel. res. after 21 iterations: 2.021383e-10 |
| approx. rel. res. after 22 iterations: 8.120851e-11 |
| approx. rel. res. after 23 iterations: 3.276034e-11 |
| approx. rel. res. after 24 iterations: 1.314241e-11 |
+----------------------------------------------------------+
+----------------------------------------------------------+
| FGMRES iterations: 24 coarse average: 3.96 |
| exact relative residual: ||r||/||b|| = 1.314241e-11 |
| elapsed wall clock time: 10.9579 seconds |
| coarse grid time: 6.8740 seconds (62.7%) |
| consumed core minutes*: 4.68e+01 (solve only) |
| max used mem/MPIproc: 1.93e-01 GB |
+----------------------------------------------------------+
Unproblematic (master + master):
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.039350 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.036769 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.034938 seconds
performed 5 iterative setup steps
elapsed time: 26.075408 seconds (12.331877 seconds on coarse grid)
DDalphaAMG setup ran, time 28.19 sec (43.75 % on coarse grid)
depth: 0, updating mu to 0.000000 on even sites and 0.000000 on odd sites
depth: 1, updating mu to 0.000000 on even sites and 0.000000 on odd sites
depth: 2, updating mu to 0.000000 on even sites and 0.000000 on odd sites
+----------------------------------------------------------+
| approx. rel. res. after 1 iterations: 2.981504e-02 |
| approx. rel. res. after 2 iterations: 8.115737e-03 |
| approx. rel. res. after 3 iterations: 1.613163e-03 |
| approx. rel. res. after 4 iterations: 3.403916e-04 |
| approx. rel. res. after 5 iterations: 6.901793e-05 |
| approx. rel. res. after 6 iterations: 1.509629e-05 |
| approx. rel. res. after 7 iterations: 3.174391e-06 |
| approx. rel. res. after 8 iterations: 6.519720e-07 |
| approx. rel. res. after 9 iterations: 1.452323e-07 |
| approx. rel. res. after 10 iterations: 3.097001e-08 |
| approx. rel. res. after 11 iterations: 6.925372e-09 |
| approx. rel. res. after 12 iterations: 1.462020e-09 |
| approx. rel. res. after 13 iterations: 3.030030e-10 |
| approx. rel. res. after 14 iterations: 6.678557e-11 |
| approx. rel. res. after 15 iterations: 1.420444e-11 |
+----------------------------------------------------------+
+----------------------------------------------------------+
| FGMRES iterations: 15 coarse average: 16.67 |
| exact relative residual: ||r||/||b|| = 1.420444e-11 |
| elapsed wall clock time: 1.6075 seconds |
| coarse grid time: 0.5843 seconds (36.3%) |
| consumed core minutes*: 6.86e+00 (solve only) |
| max used mem/MPIproc: 1.29e-01 GB |
+----------------------------------------------------------+
@sunpho84 This could be the reason why your test simulation on Marconi A2 was even slower than expected and why inversions were not converging if done outside of the HMC. If I remember correctly, we set up the TM2p1p1 branch of DDalphaAMG as well as the DDalphaAMG_nd branch of tmLQCD, correct?
Yes I was using your suggestion, that is:
https://github.com/Finkenrath/tmLQCD/tree/DDalphaAMG_nd
linked against
https://github.com/sbacchio/DDalphaAMG/commits/TM2p1p1
Ok I will work on this starting from today.. What I guess is that I broke the e/o preconditioning for the smoother when an odd sized block is used. The point is that everything is working fine on our runs and I've never noticed convergence issues.. So the problem should be in some "special" case that I didn't check.
@kostrzewa For confirming that, could you please try to run with an even sized block? like 4 3 3 3?
Thanks!
Would 6x4x4x4 be okay too?
sorry, I meant 6x3x3x3
yes should be fine! and then maybe we should try to turn off the e/o and then the SSE.
The e/o you turn it off changing line 989 of init.c in DDalphaAMG. And instead the SSE you turn it off from the make file.
So with 6x3x3x3 I get the same p->H(1,0) messages which I had not seen before.
warning: The SSE implementation is based on the odd-even preconditioned code.
Switch on odd-even preconditioning in the input file.
error: assertion "g.odd_even" failed (build/gsrc/init.c:1092)
bad choice of input parameters (please read the user manual in /doc).
So I need to disable both SSE and e/o.
And that fails:
build/gsrc/coarse_operator_float.c(47): error: identifier "SIMD_LENGTH_float" is undefined
int column_offset = 2*SIMD_LENGTH_float*((l->num_parent_eig_vect+SIMD_LENGTH_float-1)/SIMD_LENGTH_float);
^
build/gsrc/coarse_operator_float.c(55): error: identifier "SIMD_LENGTH_float" is undefined
int column_offset = SIMD_LENGTH_float*((2*l->num_parent_eig_vect+SIMD_LENGTH_float-1)/SIMD_LENGTH_float);
^
Trying a clean build.
Nope.
@sunpho84 if you're still interested in the A40.40 run (or was it A30.40 ?) you can try with the master branch of sbacchio/DDalphaAMG and the master branch of Finkenrath/tmLQCD. It might be that it works better then. (we also had an odd kind of blocking, correct?)
Ah right, clear! I forgot about that.. the SSE is based on the e/o. Removing both should work: e/o = 0 and Makefile without -DSSE in OPT_VERSION_FLAGS. Since you are editing the Makefile, could you please also enable the -DDEBUG in OPT_VERSION_FLAGS?
I'm really sorry to make you try things, but I've not been able to replicate your problem.
I tried to disable both e/o and SSE, the result is that SIMD_LENGTH_float is undefined...
@sunpho84 if you're still interested in the A40.40 run (or was it A30.40 ?) you can try with the master branch of sbacchio/DDalphaAMG and the master branch of Finkenrath/tmLQCD. It might be that it works better then. (we also had an odd kind of blocking, correct?)
I thought that the TM2p1p1 was the correct one for nf=2+1+1?
Well, yes, but if you don't run with DDalphaAMG in the heavy sector, then you don't need the extra stuff.
@sbacchio Okay, I think I might have to give up for now. I think there might be a compiler issue on the machine that I was trying this on.
@sunpho84
On Marconi A2, did you see the p->H(1,0) ... output? I can't remember.
@sbacchio So you tried to reproduce this on a 24c48 lattice with the 3x3x3x3 aggregation ? If you can't reproduce it, then the problem is probably on my side. There are some odd things going on on the machine that I tried. If I get a chance, I'll compile with GCC to see if that works.
@sunpho84 On Marconi A2, did you see the p->H(1,0) ... output? I can't remember.
yes in the old logs, see e.g. /marconi_work/INF17_lqcd123_0/sanfo/hmcnf2p1p1/A40.40/logs/log_mg_1490524967
then I've tried a few variations of the settings (following some sbacchio's suggestion) and this warning disappeared, see the logs out from logs/ folder
Sorry yesterday I had to leave early.
So I've pushed now a version which can be compiled without SSE and that have a possible bug fix.. I'm trying to compare the two versions, but I did so many changes that is hard to find the right place.
@sunpho84 can you remind me what are the differences between before and after having p->H(1,0)?
@kostrzewa I didn't have exactly that configuration, but trying with what I have I've not been able to reproduce the p->H(1,0) warning.
It looks to me as if it was happening on a random basis. Here you have a sample:
+----------------------------------------------------------+
| 2-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.937588 |
| csw: +0.000000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 80 40 40 40 |
| local lattice: 4 10 10 10 |
| block lattice: 4 5 5 5 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 24 |
+---------------------- depth 1 --------------------------+
| global lattice: 20 8 8 8 |
| local lattice: 1 2 2 2 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.012000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +nan+0.000000i
[...]
Ok I confirm that the construction of the coarse operator is broken when odd size is used in the fastest running index.
There are two solutions at the moment:
- either to use a block size which is even in X
- or comment in the file
vectorization_control.hthe lines
#define INTERPOLATION_OPERATOR_LAYOUT_OPTIMIZED_float
#define INTERPOLATION_SETUP_LAYOUT_OPTIMIZED_float
I hope to solve it today!
It should be fixed.
@kostrzewa can you check if now it works? :)
@sbacchio I'm checking this now, thanks!