tmLQCD DDalphaAMG_nd branch convergence issues

@sbacchio @Finkenrath Over the last few days I've had some time to try to understand an issue which has been bugging me a bit because I would like to run with the TM2p1p1 branch of sbacchio/DDalphaAMG and the corresponding head commit of the DDalphaAMG_nd branch of Finkenrath/tmLQCD to help with convergence in the heavy sector. However, I'm finding severe convergence problems and further issues. First a comparison to a working setup:

When I set up the head commit of the master branch of sbacchio/DDalphaAMG together with the the head commit of the master branch of FInkenrath/tmLQCD, I get great convergence in the light sector and the expected iteration counts for the given aggregation and scale parameters.

Doing the same with the aforementioned branches for 2+1+1 results in solves which do not converge and output which I have not seen before:

----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 12  6   6   6                   |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 5                               |
|            test vectors: 20                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 4   2   2   2                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 2   1   1   1                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.012000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +0.007827+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.008543+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.008483+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.008835+0.000000i
[...]
depth: 1, iter: 1, p->H(1,0) = +0.009761+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.072938 seconds
depth: 1, time spent for setting up next coarser operator: 0.057122 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.063018 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.057935 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.082971 seconds

performed 5 iterative setup steps
elapsed time: 13.714705 seconds (2.121091 seconds on coarse grid)

DDalphaAMG setup ran, time 15.61 sec (13.59 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites 
+----------------------------------------------------------+
depth: 1, iter: 1, p->H(1,0) = +0.008605+0.000000i
| approx. rel. res. after  1      iterations: 2.686549e-02 |
| approx. rel. res. after  2      iterations: 9.386865e-03 |
| approx. rel. res. after  3      iterations: 3.141994e-03 |
| approx. rel. res. after  4      iterations: 1.246548e-03 |
| approx. rel. res. after  5      iterations: 4.854671e-04 |
| approx. rel. res. after  6      iterations: 1.898306e-04 |
| approx. rel. res. after  7      iterations: 7.727864e-05 |
| approx. rel. res. after  8      iterations: 3.056149e-05 |
| approx. rel. res. after  9      iterations: 1.221386e-05 |
| approx. rel. res. after  10     iterations: 4.911786e-06 |
| approx. rel. res. after  11     iterations: 1.944398e-06 |
| approx. rel. res. after  12     iterations: 7.717114e-07 |
| approx. rel. res. after  13     iterations: 3.055015e-07 |
| approx. rel. res. after  14     iterations: 1.214677e-07 |
| approx. rel. res. after  15     iterations: 4.836682e-08 |
| approx. rel. res. after  16     iterations: 1.907075e-08 |
| approx. rel. res. after  17     iterations: 7.568452e-09 |
| approx. rel. res. after  18     iterations: 3.016249e-09 |
| approx. rel. res. after  19     iterations: 1.199059e-09 |
| approx. rel. res. after  20     iterations: 4.778359e-10 |
| approx. rel. res. after  21     iterations: 1.885605e-10 |
| approx. rel. res. after  22     iterations: 7.484878e-11 |
| approx. rel. res. after  23     iterations: 2.994289e-11 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 23     coarse average: 3.96     |
| exact relative residual: ||r||/||b|| = 2.994289e-11      |
| elapsed wall clock time: 14.0737  seconds                |
|        coarse grid time: 6.6641   seconds (47.4%)        |
|  consumed core minutes*: 6.00e+01 (solve only)           |
|    max used mem/MPIproc: 1.93e-01 GB                     |
+----------------------------------------------------------+

To compare, the working setup looks like this:

+----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 12  6   6   6                   |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 5                               |
|            test vectors: 20                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 4   2   2   2                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 2   1   1   1                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.012000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 0.554985 seconds
depth: 1, time spent for setting up next coarser operator: 0.043112 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.044045 seconds
depth: 0, bootstrap step number 2...
depth: 0, time spent for setting up next coarser operator: 0.558093 seconds
depth: 1, time spent for setting up next coarser operator: 0.045157 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.031808 seconds
depth: 0, bootstrap step number 3...
depth: 0, time spent for setting up next coarser operator: 0.556642 seconds
[...]
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.029956 seconds
depth: 0, bootstrap step number 5...
depth: 0, time spent for setting up next coarser operator: 0.556980 seconds
depth: 1, time spent for setting up next coarser operator: 0.059933 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.028399 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.028356 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.033057 seconds

performed 5 iterative setup steps
elapsed time: 25.091544 seconds (12.091816 seconds on coarse grid)

DDalphaAMG setup ran, time 27.47 sec (44.02 % on coarse grid)
depth: 0, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
depth: 1, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
depth: 2, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
+----------------------------------------------------------+
| approx. rel. res. after  1      iterations: 2.979074e-02 |
| approx. rel. res. after  2      iterations: 8.042268e-03 |
| approx. rel. res. after  3      iterations: 1.584980e-03 |
| approx. rel. res. after  4      iterations: 3.340151e-04 |
| approx. rel. res. after  5      iterations: 7.525576e-05 |
| approx. rel. res. after  6      iterations: 1.551435e-05 |
| approx. rel. res. after  7      iterations: 3.158749e-06 |
| approx. rel. res. after  8      iterations: 7.007767e-07 |
| approx. rel. res. after  9      iterations: 1.494747e-07 |
| approx. rel. res. after  10     iterations: 3.354428e-08 |
| approx. rel. res. after  11     iterations: 7.172643e-09 |
| approx. rel. res. after  12     iterations: 1.493532e-09 |
| approx. rel. res. after  13     iterations: 3.296716e-10 |
| approx. rel. res. after  14     iterations: 7.064493e-11 |
| approx. rel. res. after  15     iterations: 1.588326e-11 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 15     coarse average: 15.67    |
| exact relative residual: ||r||/||b|| = 1.588326e-11      |
| elapsed wall clock time: 1.5327   seconds                |
|        coarse grid time: 0.6300   seconds (41.1%)        |
|  consumed core minutes*: 6.54e+00 (solve only)           |
|    max used mem/MPIproc: 1.29e-01 GB                     |
+----------------------------------------------------------+

and is significantly faster, as you can see.

Have you seen this behaviour?

Apr 17 '17 11:04 kostrzewa

I think the problem is here:

DDalphaAMG setup ran, time 15.61 sec (13.59 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites

there is a big change in mu on the odd sites.. somehow in the setup phase a wrong g_mu3 is used. What is your input file? Which executable are you using?

Apr 17 '17 12:04 sbacchio

This is in the HMC, so I would expect that the problematic output is actually correct, this is for the following setup:

BeginDDalphaAMG
  MGBlockX = 3
  MGBlockY = 3
  MGBlockZ = 3
  MGBlockT = 3
  MGSetupIter = 5
  MGCoarseSetupIter = 3
  MGNumberOfVectors = 20
  MGNumberOfLevels = 3
  MGCoarseMuFactor = 3
  MGdtauUpdate = 0.0624
  MGUpdateSetupIter = 1
  MGOMPNumThreads = 1
EndDDalphaAMG

and the following monomial triggers the first solve with ddalphaamg

BeginMonomial CLOVERDETRATIO
  Timescale = 2
  kappa = 0.1400645
  2KappaMu = 0.001120516
  # numerator shift
  rho = 0.02016936
  # denominator shift, should match CLOVERDET shift
  rho2 = 0.10420836
  CSW = 1.74
  MaxSolverIterations = 60000
  AcceptancePrecision =  1.e-21
  ForcePrecision = 1.e-18
  Name = cloverdetratio1light
  solver = ddalphaamg
EndMonomial

When I use the problematic version in invert to find optimal parameters, I get the same problems:

DDalphaAMG cnfg set, plaquette 5.432070e-01
DDalphaAMG running setup
initial definition --- depth: 0
depth: 0, time spent for setting up next coarser operator: 0.919600 seconds
initial definition --- depth: 1
depth: 1, time spent for setting up next coarser operator: 0.021723 seconds

initial coarse grid correction is defined
elapsed time: 8.644193 seconds

+----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 12  6   6   6                   |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 4                               |
|            test vectors: 24                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 4   2   2   2                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 2   1   1   1                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.028000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +0.007809+0.000000i
[...]
depth: 1, iter: 1, p->H(1,0) = +0.009952+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009985+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009946+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.105289 seconds
depth: 1, time spent for setting up next coarser operator: 0.918843 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.019513 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.019073 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.021341 seconds

performed 4 iterative setup steps
elapsed time: 21.283853 seconds (2.872039 seconds on coarse grid)

DDalphaAMG setup ran, time 29.94 sec (9.59 % on coarse grid)
+----------------------------------------------------------+
| approx. rel. res. after  1      iterations: 6.601027e-02 |
| approx. rel. res. after  2      iterations: 3.342283e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009991+0.000000i
| approx. rel. res. after  3      iterations: 2.425828e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009872+0.000000i
| approx. rel. res. after  4      iterations: 1.956784e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009959+0.000000i
| approx. rel. res. after  5      iterations: 1.715145e-02 |
[...] -> no convergence before 600 iterations

while the master branch works rather better

initial definition --- depth: 0
depth: 0, time spent for setting up next coarser operator: 1.172197 seconds
initial definition --- depth: 1
depth: 1, time spent for setting up next coarser operator: 0.116010 seconds

initial coarse grid correction is defined
elapsed time: 4.875110 seconds

+----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 6   6   12  12                  |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 4                               |
|            test vectors: 24                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 2   2   4   4                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 1   1   2   2                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.028000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 1.151630 seconds
depth: 1, time spent for setting up next coarser operator: 0.109204 seconds
depth: 1, bootstrap step number 1...
[...]
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.104408 seconds

performed 4 iterative setup steps
elapsed time: 62.497544 seconds (38.668907 seconds on coarse grid)

DDalphaAMG setup ran, time 67.38 sec (57.39 % on coarse grid)
+----------------------------------------------------------+
| approx. rel. res. after  1      iterations: 5.388873e-02 |
| approx. rel. res. after  2      iterations: 1.388262e-02 |
| approx. rel. res. after  3      iterations: 3.364761e-03 |
| approx. rel. res. after  4      iterations: 8.359057e-04 |
| approx. rel. res. after  5      iterations: 1.990664e-04 |
| approx. rel. res. after  6      iterations: 4.952127e-05 |
| approx. rel. res. after  7      iterations: 1.263903e-05 |
| approx. rel. res. after  8      iterations: 3.351799e-06 |
| approx. rel. res. after  9      iterations: 8.567047e-07 |
| approx. rel. res. after  10     iterations: 2.091744e-07 |
| approx. rel. res. after  11     iterations: 5.094827e-08 |
| approx. rel. res. after  12     iterations: 1.216494e-08 |
| approx. rel. res. after  13     iterations: 2.904565e-09 |
| approx. rel. res. after  14     iterations: 6.856662e-10 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 14     coarse average: 292.79   |
| exact relative residual: ||r||/||b|| = 6.856662e-10      |
| elapsed wall clock time: 6.2996   seconds                |
|        coarse grid time: 4.8121   seconds (76.4%)        |
|  consumed core minutes*: 1.34e+01 (solve only)           |
|    max used mem/MPIproc: 2.78e-01 GB                     |
+----------------------------------------------------------+

Note that between the two runs above, there is a factor of two in the number of processes. However, I see the same problems with the same number of processes, I just don't have results for the particular, exemplary set of parameters.

Apr 17 '17 12:04 kostrzewa

Hmm I don't like it. It's something we didn't notice on the runs for the Nf=2+1+1 ensamble. And we use the same package setup.

Something you could try, but I don't know if it will work, is to link the branch master of tmLQCD to the DDalphaAMG_nd branch of DDalphaAMG. So we check if the problem is in the interface or in the solver.

I will check the changes I did and try to come out with some idea

Apr 17 '17 12:04 sbacchio

Is it because I haven't specified MGNumberOfShifts = 4 ?

Apr 17 '17 12:04 kostrzewa

Something you could try, but I don't know if it will work, is to link the branch master of tmLQCD to the DDalphaAMG_nd branch of DDalphaAMG. So we check if the problem is in the interface or in the solver.

Will test this out.

Apr 17 '17 12:04 kostrzewa

It seems that the problem is in DDalphaAMG, rather than the interface. Using the master branch of Finkenrath/tmLQCD together with the TM2p1p1 branch of sbacchio/DDalphaAMG has the same problems as described above:

Problematic:

depth: 1, iter: 1, p->H(1,0) = +0.009670+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009650+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009739+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.073741 seconds
depth: 1, time spent for setting up next coarser operator: 0.042813 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.048123 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.036359 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.130586 seconds

performed 5 iterative setup steps
elapsed time: 13.709875 seconds (2.341967 seconds on coarse grid)

DDalphaAMG setup ran, time 15.94 sec (14.69 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites 
+----------------------------------------------------------+
depth: 1, iter: 1, p->H(1,0) = +0.008553+0.000000i
| approx. rel. res. after  1      iterations: 2.693876e-02 |
| approx. rel. res. after  2      iterations: 9.422674e-03 |
| approx. rel. res. after  3      iterations: 3.136621e-03 |
| approx. rel. res. after  4      iterations: 1.244779e-03 |
| approx. rel. res. after  5      iterations: 4.886695e-04 |
| approx. rel. res. after  6      iterations: 1.909823e-04 |
| approx. rel. res. after  7      iterations: 7.708101e-05 |
| approx. rel. res. after  8      iterations: 3.028029e-05 |
| approx. rel. res. after  9      iterations: 1.209484e-05 |
| approx. rel. res. after  10     iterations: 4.876731e-06 |
| approx. rel. res. after  11     iterations: 1.936528e-06 |
| approx. rel. res. after  12     iterations: 7.807262e-07 |
| approx. rel. res. after  13     iterations: 3.124696e-07 |
| approx. rel. res. after  14     iterations: 1.238244e-07 |
| approx. rel. res. after  15     iterations: 4.957753e-08 |
| approx. rel. res. after  16     iterations: 1.986782e-08 |
| approx. rel. res. after  17     iterations: 7.987017e-09 |
| approx. rel. res. after  18     iterations: 3.190570e-09 |
| approx. rel. res. after  19     iterations: 1.264548e-09 |
| approx. rel. res. after  20     iterations: 5.055527e-10 |
| approx. rel. res. after  21     iterations: 2.021383e-10 |
| approx. rel. res. after  22     iterations: 8.120851e-11 |
| approx. rel. res. after  23     iterations: 3.276034e-11 |
| approx. rel. res. after  24     iterations: 1.314241e-11 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 24     coarse average: 3.96     |
| exact relative residual: ||r||/||b|| = 1.314241e-11      |
| elapsed wall clock time: 10.9579  seconds                |
|        coarse grid time: 6.8740   seconds (62.7%)        |
|  consumed core minutes*: 4.68e+01 (solve only)           |
|    max used mem/MPIproc: 1.93e-01 GB                     |
+----------------------------------------------------------+

Unproblematic (master + master):

depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.039350 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.036769 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.034938 seconds

performed 5 iterative setup steps
elapsed time: 26.075408 seconds (12.331877 seconds on coarse grid)

DDalphaAMG setup ran, time 28.19 sec (43.75 % on coarse grid)
depth: 0, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
depth: 1, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
depth: 2, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
+----------------------------------------------------------+
| approx. rel. res. after  1      iterations: 2.981504e-02 |
| approx. rel. res. after  2      iterations: 8.115737e-03 |
| approx. rel. res. after  3      iterations: 1.613163e-03 |
| approx. rel. res. after  4      iterations: 3.403916e-04 |
| approx. rel. res. after  5      iterations: 6.901793e-05 |
| approx. rel. res. after  6      iterations: 1.509629e-05 |
| approx. rel. res. after  7      iterations: 3.174391e-06 |
| approx. rel. res. after  8      iterations: 6.519720e-07 |
| approx. rel. res. after  9      iterations: 1.452323e-07 |
| approx. rel. res. after  10     iterations: 3.097001e-08 |
| approx. rel. res. after  11     iterations: 6.925372e-09 |
| approx. rel. res. after  12     iterations: 1.462020e-09 |
| approx. rel. res. after  13     iterations: 3.030030e-10 |
| approx. rel. res. after  14     iterations: 6.678557e-11 |
| approx. rel. res. after  15     iterations: 1.420444e-11 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 15     coarse average: 16.67    |
| exact relative residual: ||r||/||b|| = 1.420444e-11      |
| elapsed wall clock time: 1.6075   seconds                |
|        coarse grid time: 0.5843   seconds (36.3%)        |
|  consumed core minutes*: 6.86e+00 (solve only)           |
|    max used mem/MPIproc: 1.29e-01 GB                     |
+----------------------------------------------------------+

Apr 17 '17 14:04 kostrzewa

@sunpho84 This could be the reason why your test simulation on Marconi A2 was even slower than expected and why inversions were not converging if done outside of the HMC. If I remember correctly, we set up the TM2p1p1 branch of DDalphaAMG as well as the DDalphaAMG_nd branch of tmLQCD, correct?

Apr 17 '17 14:04 kostrzewa

Yes I was using your suggestion, that is:

https://github.com/Finkenrath/tmLQCD/tree/DDalphaAMG_nd

linked against

https://github.com/sbacchio/DDalphaAMG/commits/TM2p1p1

Apr 17 '17 19:04 sunpho84

Ok I will work on this starting from today.. What I guess is that I broke the e/o preconditioning for the smoother when an odd sized block is used. The point is that everything is working fine on our runs and I've never noticed convergence issues.. So the problem should be in some "special" case that I didn't check.

@kostrzewa For confirming that, could you please try to run with an even sized block? like 4 3 3 3?

Thanks!

Apr 20 '17 09:04 sbacchio

Would 6x4x4x4 be okay too?

Apr 20 '17 09:04 kostrzewa

sorry, I meant 6x3x3x3

Apr 20 '17 09:04 kostrzewa

yes should be fine! and then maybe we should try to turn off the e/o and then the SSE.

The e/o you turn it off changing line 989 of init.c in DDalphaAMG. And instead the SSE you turn it off from the make file.

Apr 20 '17 09:04 sbacchio

So with 6x3x3x3 I get the same p->H(1,0) messages which I had not seen before.

Apr 20 '17 10:04 kostrzewa

warning: The SSE implementation is based on the odd-even preconditioned code.    
         Switch on odd-even preconditioning in the input file.
error: assertion "g.odd_even" failed (build/gsrc/init.c:1092)
       bad choice of input parameters (please read the user manual in /doc).

So I need to disable both SSE and e/o.

Apr 20 '17 12:04 kostrzewa

And that fails:

build/gsrc/coarse_operator_float.c(47): error: identifier "SIMD_LENGTH_float" is undefined
      int column_offset = 2*SIMD_LENGTH_float*((l->num_parent_eig_vect+SIMD_LENGTH_float-1)/SIMD_LENGTH_float);
                            ^

build/gsrc/coarse_operator_float.c(55): error: identifier "SIMD_LENGTH_float" is undefined
      int column_offset = SIMD_LENGTH_float*((2*l->num_parent_eig_vect+SIMD_LENGTH_float-1)/SIMD_LENGTH_float);
                          ^

Apr 20 '17 12:04 kostrzewa

Trying a clean build.

Apr 20 '17 12:04 kostrzewa

Nope.

Apr 20 '17 12:04 kostrzewa

@sunpho84 if you're still interested in the A40.40 run (or was it A30.40 ?) you can try with the master branch of sbacchio/DDalphaAMG and the master branch of Finkenrath/tmLQCD. It might be that it works better then. (we also had an odd kind of blocking, correct?)

Apr 20 '17 12:04 kostrzewa

Ah right, clear! I forgot about that.. the SSE is based on the e/o. Removing both should work: e/o = 0 and Makefile without -DSSE in OPT_VERSION_FLAGS. Since you are editing the Makefile, could you please also enable the -DDEBUG in OPT_VERSION_FLAGS?

I'm really sorry to make you try things, but I've not been able to replicate your problem.

Apr 20 '17 13:04 sbacchio

I tried to disable both e/o and SSE, the result is that SIMD_LENGTH_float is undefined...

Apr 20 '17 13:04 kostrzewa

@sunpho84 if you're still interested in the A40.40 run (or was it A30.40 ?) you can try with the master branch of sbacchio/DDalphaAMG and the master branch of Finkenrath/tmLQCD. It might be that it works better then. (we also had an odd kind of blocking, correct?)

I thought that the TM2p1p1 was the correct one for nf=2+1+1?

Apr 20 '17 13:04 sunpho84

Well, yes, but if you don't run with DDalphaAMG in the heavy sector, then you don't need the extra stuff.

Apr 20 '17 13:04 kostrzewa

@sbacchio Okay, I think I might have to give up for now. I think there might be a compiler issue on the machine that I was trying this on.

@sunpho84 On Marconi A2, did you see the p->H(1,0) ... output? I can't remember.

Apr 20 '17 14:04 kostrzewa

@sbacchio So you tried to reproduce this on a 24c48 lattice with the 3x3x3x3 aggregation ? If you can't reproduce it, then the problem is probably on my side. There are some odd things going on on the machine that I tried. If I get a chance, I'll compile with GCC to see if that works.

Apr 20 '17 16:04 kostrzewa

@sunpho84 On Marconi A2, did you see the p->H(1,0) ... output? I can't remember.

yes in the old logs, see e.g. /marconi_work/INF17_lqcd123_0/sanfo/hmcnf2p1p1/A40.40/logs/log_mg_1490524967

then I've tried a few variations of the settings (following some sbacchio's suggestion) and this warning disappeared, see the logs out from logs/ folder

Apr 20 '17 18:04 sunpho84

Sorry yesterday I had to leave early.

So I've pushed now a version which can be compiled without SSE and that have a possible bug fix.. I'm trying to compare the two versions, but I did so many changes that is hard to find the right place.

@sunpho84 can you remind me what are the differences between before and after having p->H(1,0)?

@kostrzewa I didn't have exactly that configuration, but trying with what I have I've not been able to reproduce the p->H(1,0) warning.

Apr 21 '17 07:04 sbacchio

It looks to me as if it was happening on a random basis. Here you have a sample:

+----------------------------------------------------------+
| 2-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.937588                       |
|                     csw: +0.000000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 80  40  40  40                  |
|           local lattice: 4   10  10  10                  |
|           block lattice: 4   5   5   5                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 24                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 20  8   8   8                   |
|           local lattice: 1   2   2   2                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.012000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +nan+0.000000i
[...]

Apr 21 '17 09:04 sunpho84

Ok I confirm that the construction of the coarse operator is broken when odd size is used in the fastest running index.

There are two solutions at the moment:

either to use a block size which is even in X
or comment in the file vectorization_control.h the lines

#define INTERPOLATION_OPERATOR_LAYOUT_OPTIMIZED_float
#define INTERPOLATION_SETUP_LAYOUT_OPTIMIZED_float

I hope to solve it today!

Apr 21 '17 12:04 sbacchio

It should be fixed.

@kostrzewa can you check if now it works? :)

Apr 23 '17 14:04 sbacchio

@sbacchio I'm checking this now, thanks!

Apr 29 '17 08:04 kostrzewa