hypre
hypre copied to clipboard
Question: best way to achieve OpenMP parallelization
I am considering using HYPRE as a replacement for the in-house linear solver we have now in our commercial CFD code FLACS-CFD. The reason for a change is that we would like to implement AMR in the solver and SSTRUCT seems perfect for the purpose. With our software we are targeting desktop PCs and single nodes on clusters, therefore we never needed to implement MPI. The OpenMP parallelization is sufficient at such scale. I noticed that the user manual states that most of the solvers is lacking OpenMP parallelization and indeed I haven’t found any omp pragmas in the implementation of my solver of interest – BiCGSTAB or preconditioners. Therefore I would like to ask you for advice:
- Do you plan or have started work on the OpenMP parallelization on some branch, is there a project I could join?
- Can you recommend how to quickly integrate MPI version of HYPRE with OpenMP/serial implementation of the rest of the CFD code assuming we currently do not have any domain partitioning?
- would using external OpenMP-parallel algebra package together with HYPRE be a solution
Hi @mfolusiak . We do support OpenMP parallelism in hypre, and I think that's available for most of the solvers. Can you point to where we say this in the manual:
https://hypre.readthedocs.io/en/latest/
Can you tell us more about the in-house solver you use? Thanks!
Hi @rfalgout , I found it on the intro page: https://github.com/hypre-space/hypre/blob/master/src/docs/usr-manual/ch-intro.rst#installing-hypre
Configuration of hypre with threads requires an implementation of OpenMP. Currently, only a subset of hypre is threaded.
Compiling with HYPRE_WITH_OPENMP=ON
didn't seem to have an effect on performance in my initial tests, therefore I assumed my solver is not parallelized and asked this question. In the meantime today I found that indeed, the parallelization is probably realized using some abstraction called Kokkos. Is there any additional libraries or configuration needed for it to work? I think it is worth mentioning in the manual. The OpenMP parallelization is a great asset.
The in-house linear solver we are using now in our software is a block-structured BiCGSTAB with ILU preconditioner. We are using 7-diagonal format to specify the A arrar. Thank you for your help!
Hi @mfolusiak . BiCGSTAB should work fine with OpenMP (Kokkos isn't required to use OpenMP in hypre). ILU may be a problem. Maybe @liruipeng or @oseikuffuor1 can comment on that.
thanks @rfalgout - one follow up question, is this supported on Windows as well? Reason I ask is that I noted in the stable manual it stated that it requires OpenMP standard >= 4.5 whereas we compile our application with MS VS 2019 and I believe they are not quite there yet, it isn't entirely clear but looks like they're at OpenMP standard 3.0 or even 2.0 actually.
https://devblogs.microsoft.com/cppblog/improved-openmp-support-for-cpp-in-visual-studio/#:~:text=Microsoft%20Visual%20Studio%20has%20supported,in%20the%20OpenMP%204.0%20standard.
I read elsewhere that some options include - building with C++ CLang (which is now shipped with MS VS) or building using MNGW... suppose we do the latter (build/install HYPRE with MNGW) can we still use the openmp directives in our application and build with VS (given it is using older standard), will that work? Appreciate any comment on that, and in general on what is the recommended option, if any, for Windows?
Hi @sgthomas-github . OpenMP standard 3.0 will work fine for CPU code. The 4.5 version is one approach for running on GPUs (Struct interface only), but not the recommended way to run on them anyway. Hope this helps!
Thanks @rfalgout et al, I tested the latest hypre-2.24.0 installs, configured/compiled with and without openmp enabled respectively (our application is C++ built using msvc42 on VS2019) and while both installs work OK and match expected results on small controlled models, they fail on a complex extremely large (~53MM cells), extremely heterogeneous (stiffness matrix diagonal entries contrast as high as 1.0e13) model, both reach max iterations (1000), and the openmp install also reports nan's and extremely large residuals, while the serial install simply maxes on the iterations but never gets below tolerance 1.0e-08 (however, at least remains close - slightly higher). However the older version I had, i.e., 2.11.1 serial (without openmp) runs OK and converges in about 25 iterations on the complex model.
Wondering what could cause the change in behavior from older version? I have been using BoomerAMG as solver all along, however I noted that some of its defaults have changed since the prior version (e.g. parallel coarsening strategy, interpolation, and relaxation order), not sure how critical my choices were (6 for parallel coarsening, relax type set to 3, nothing set for relax order or interpolation - so they should default I guess, num sweeps set to 1, max levels set to 20) and whether they are contributing to the failure to converge - I am going to test next with those recommended defaults for 3-d; also unclear as to what specific options, if any, I should use when testing with openmp or are they the same as documented? Finally is there any solver/preconditioner recommendations for elliptic models (3-d diffusion) beyond BoomerAMG as solver for extremely large, extremely heterogeneous 3-d models? Thank you.
Hi @sgthomas-github . I'm pretty sure the default BoomerAMG parameters have indeed changed since 2.11.1, so that's definitely the first thing you should try to match. The OpenMP behavior will be different, but hopefully there are parameter choices that can also be made to work well for your problems. Once you've done more testing, send us the output generated by setting the print level to 1 or higher (call HYPRE_BoomerAMGSetPrintLevel()
). The output will have information like this:
BoomerAMG SETUP PARAMETERS:
Max levels = 25
Num levels = 5
Strength Threshold = 0.250000
Interpolation Truncation Factor = 0.000000
Maximum Row Sum Threshold for Dependency Weakening = 1.000000
Coarsening Type = HMIS
measures are determined locally
No global partition option chosen.
Interpolation = extended+i interpolation
Operator Matrix Information:
nonzero entries/row row sums
lev rows entries sparse min max avg min max
======================================================================
0 1000 6400 0.006 4 7 6.4 0.000e+00 3.000e+00
1 500 7248 0.029 7 17 14.5 0.000e+00 4.000e+00
2 99 3003 0.306 15 43 30.3 1.041e-02 5.319e+00
3 14 188 0.959 11 14 13.4 5.274e+00 1.007e+01
4 4 16 1.000 4 4 4.0 7.597e+00 9.196e+00
Interpolation Matrix Information:
entries/row min max row sums
lev rows x cols min max avgW weight weight min max
================================================================================
0 1000 x 500 1 4 4.0 1.667e-01 2.500e-01 5.000e-01 1.000e+00
1 500 x 99 1 4 4.0 1.301e-02 3.547e-01 2.164e-01 1.000e+00
2 99 x 14 1 4 4.0 1.247e-03 3.928e-01 2.865e-02 1.000e+00
3 14 x 4 1 4 3.6 -6.320e-02 6.629e-02 -6.121e-02 1.000e+00
Complexity: grid = 1.617000
operator = 2.633594
memory = 3.350625
And similar information for the solver parameters. This will help us to figure out how best to help you. Thanks!
Thanks @rfalgout et al, update fyi -
After I used the recommended defaults for 3-d elliptic (diffusion) problems, the BoomerAMG solver is now converging; basically I used method 10 (HMIS) for coarsening type, method 6 (extended+i) for interpolation type, a truncation factor of 5 (perhaps could be reduced to 4?) - don't know why it reports as 0 though - I used the HYPRE_BoomerAMGSetTruncFactor
API for that, and for smoothing I used 13 for down cycle, 14 for up cycle, and 9 for coarsest level using the HYPRE_BoomerAMGSetCycleRelaxType
API for all of them, and a strength threshold of 0.5 (I was using 0.25 before). It wasn't clear to me how/whether to set smoothing on fine grid - I only set on down, up cycle, and coarsest level. Here is a copy-paste of the AMG solver parameters output with my settings for 3-different solves:
I-dir solve:
BoomerAMG SETUP PARAMETERS:
Max levels = 20
Num levels = 14
Strength Threshold = 0.500000
Interpolation Truncation Factor = 0.000000
Maximum Row Sum Threshold for Dependency Weakening = 0.900000
Coarsening Type = HMIS
measures are determined locally
No global partition option chosen.
Interpolation = extended+i interpolation
Operator Matrix Information:
nonzero entries/row row sums
lev rows entries sparse min max avg min max
========================================================================
0 52585034 367001350 0.000 3 16 7.0 -3.789e-03 2.366e-12
1 26055591 381547697 0.000 3 48 14.6 -6.679e-03 2.453e-12
2 12620819 205386377 0.000 3 88 16.3 -1.030e-02 4.098e-12
3 5789306 173898730 0.000 3 189 30.0 -2.039e-02 1.219e-05
4 2406096 123201954 0.000 4 313 51.2 -6.031e-02 8.485e-07
5 664071 57692083 0.000 5 407 86.9 -1.572e-01 2.372e-04
6 159206 19741416 0.001 5 463 124.0 -4.319e-01 4.925e-02
7 36098 5190622 0.004 11 518 143.8 -6.580e-01 4.754e-02
8 7671 910105 0.015 17 380 118.6 -1.049e+00 1.970e-01
9 1507 114827 0.051 19 213 76.2 -1.090e+00 1.570e+00
10 331 17379 0.159 13 112 52.5 -1.963e+03 1.188e+03
11 78 2622 0.431 10 50 33.6 -1.388e+00 4.481e-03
12 24 420 0.729 11 23 17.5 -1.573e+00 2.681e-06
13 6 34 0.944 5 6 5.7 -1.757e+00 -7.207e-02
Interpolation Matrix Information:
entries/row min max row sums
lev rows x cols min max avgW weight weight min max
======================================================================================
0 52585034 x 26055591 1 4 2.0 2.763e-02 1.000e+00 8.121e-01 1.000e+00
1 26055591 x 12620819 1 4 2.0 2.129e-02 1.000e+00 4.826e-01 1.000e+00
2 12620819 x 5789306 1 4 3.3 -5.985e-01 1.017e+00 3.801e-01 1.000e+00
3 5789306 x 2406096 1 4 3.1 -5.164e+00 6.708e+00 8.982e-02 1.001e+00
4 2406096 x 664071 1 4 3.7 -9.342e+01 1.018e+02 4.476e-02 1.000e+00
5 664071 x 159206 1 4 3.6 -2.540e+02 2.067e+02 -7.027e-02 1.028e+00
6 159206 x 36098 0 4 3.3 -6.946e+03 3.366e+03 -4.769e-01 1.459e+00
7 36098 x 7671 0 4 3.1 -1.159e+03 1.264e+03 -2.067e+00 2.616e+00
8 7671 x 1507 0 4 2.6 -6.005e+01 1.144e+02 -1.383e-01 1.292e+00
9 1507 x 331 1 4 2.5 -5.542e+02 9.167e+02 7.942e-02 7.183e+00
10 331 x 78 0 4 2.3 -8.998e-01 1.049e+00 -4.381e-01 1.049e+00
11 78 x 24 1 4 2.5 4.275e-02 1.000e+00 2.690e-01 1.003e+00
12 24 x 6 1 3 1.9 1.276e-01 1.000e+00 4.619e-01 1.000e+00
Complexity: grid = 1.907878
operator = 3.636787
memory = 4.098985
BoomerAMG SOLVER PARAMETERS:
Maximum number of cycles: 1000
Stopping Tolerance: 1.000000e-08
Cycle type (1 = V, 2 = W, etc.): 1
Relaxation Parameters:
Visiting Grid: down up coarse
Number of sweeps: 1 1 1
Type 0=Jac, 3=hGS, 6=hSGS, 9=GE: 13 14 9
Point types, partial sweeps (1=C, -1=F):
Pre-CG relaxation (down): 0
Post-CG relaxation (up): 0
Coarsest grid: 0
J-dir solve:
BoomerAMG SETUP PARAMETERS:
Max levels = 20
Num levels = 14
Strength Threshold = 0.500000
Interpolation Truncation Factor = 0.000000
Maximum Row Sum Threshold for Dependency Weakening = 0.900000
Coarsening Type = HMIS
measures are determined locally
No global partition option chosen.
Interpolation = extended+i interpolation
Operator Matrix Information:
nonzero entries/row row sums
lev rows entries sparse min max avg min max
========================================================================
0 52585034 367001350 0.000 3 16 7.0 -3.791e-03 2.366e-12
1 26055591 381547697 0.000 3 48 14.6 -6.658e-03 2.453e-12
2 12620743 205369991 0.000 3 88 16.3 -1.085e-02 4.098e-12
3 5788682 173976184 0.000 3 189 30.1 -1.857e-02 2.057e-05
4 2401772 122503076 0.000 4 313 51.0 -1.935e-02 5.924e-09
5 664250 58033856 0.000 5 410 87.4 -3.459e-02 4.735e-04
6 160808 19990486 0.001 5 465 124.3 -1.339e+00 5.667e-01
7 36649 5330397 0.004 12 471 145.4 -3.959e-01 1.706e-01
8 7825 947915 0.015 15 390 121.1 -2.720e-01 9.136e-02
9 1566 121970 0.050 13 203 77.9 -7.766e+00 2.506e+00
10 324 16592 0.158 12 111 51.2 -3.334e+02 2.209e+02
11 77 2475 0.417 6 58 32.1 -3.979e+00 3.089e-01
12 25 407 0.651 5 22 16.3 -9.450e-01 -2.946e-02
13 6 36 1.000 6 6 6.0 -1.776e+00 -5.459e-01
Interpolation Matrix Information:
entries/row min max row sums
lev rows x cols min max avgW weight weight min max
======================================================================================
0 52585034 x 26055591 1 4 2.0 2.763e-02 1.000e+00 8.121e-01 1.000e+00
1 26055591 x 12620743 1 4 2.0 2.129e-02 1.000e+00 4.860e-01 1.000e+00
2 12620743 x 5788682 1 4 3.3 -5.985e-01 1.017e+00 3.800e-01 1.000e+00
3 5788682 x 2401772 1 4 3.1 -1.377e+00 1.890e+00 7.853e-02 1.000e+00
4 2401772 x 664250 1 4 3.7 -9.342e+01 1.018e+02 4.200e-02 1.000e+00
5 664250 x 160808 1 4 3.6 -2.375e+02 1.979e+02 -2.010e-01 1.039e+00
6 160808 x 36649 0 4 3.4 -4.545e+02 2.663e+02 -1.412e+00 1.271e+00
7 36649 x 7825 0 4 3.1 -1.277e+02 1.508e+02 -5.011e-01 2.128e+00
8 7825 x 1566 0 4 2.7 -6.585e+01 5.626e+01 -4.549e+00 2.906e+00
9 1566 x 324 0 4 2.5 -3.428e+01 5.248e+01 0.000e+00 9.914e+00
10 324 x 77 0 4 2.2 2.997e-02 1.500e+00 0.000e+00 1.500e+00
11 77 x 25 1 4 2.5 3.998e-02 9.814e-01 1.537e-01 1.008e+00
12 25 x 6 1 3 1.9 1.000e-01 9.409e-01 3.100e-01 1.000e+00
Complexity: grid = 1.907831
operator = 3.637159
memory = 4.099335
BoomerAMG SOLVER PARAMETERS:
Maximum number of cycles: 1000
Stopping Tolerance: 1.000000e-08
Cycle type (1 = V, 2 = W, etc.): 1
Relaxation Parameters:
Visiting Grid: down up coarse
Number of sweeps: 1 1 1
Type 0=Jac, 3=hGS, 6=hSGS, 9=GE: 13 14 9
Point types, partial sweeps (1=C, -1=F):
Pre-CG relaxation (down): 0
Post-CG relaxation (up): 0
Coarsest grid: 0
K-dir solve:
BoomerAMG SETUP PARAMETERS:
Max levels = 20
Num levels = 14
Strength Threshold = 0.500000
Interpolation Truncation Factor = 0.000000
Maximum Row Sum Threshold for Dependency Weakening = 0.900000
Coarsening Type = HMIS
measures are determined locally
No global partition option chosen.
Interpolation = extended+i interpolation
Operator Matrix Information:
nonzero entries/row row sums
lev rows entries sparse min max avg min max
========================================================================
0 52585034 367001350 0.000 3 16 7.0 -2.885e+02 2.366e-12
1 26055591 381547697 0.000 3 48 14.6 -3.388e+02 2.453e-12
2 12619201 203290507 0.000 3 88 16.1 -2.963e+02 4.098e-12
3 5768656 172486924 0.000 3 189 29.9 -3.246e+02 1.946e-05
4 2401754 122255830 0.000 4 313 50.9 -1.630e+02 5.434e-04
5 668852 57440306 0.000 5 422 85.9 -1.201e+02 2.037e-01
6 162105 19646451 0.001 6 489 121.2 -5.266e+01 2.527e+00
7 36493 5102049 0.004 9 526 139.8 -9.439e+01 4.897e+01
8 7616 842690 0.015 11 324 110.6 -2.985e+04 6.710e+03
9 1458 101876 0.048 16 173 69.9 -7.759e+04 5.186e+01
10 309 13349 0.140 8 86 43.2 -4.424e+00 3.433e-01
11 80 2140 0.334 8 51 26.8 -2.236e+00 -7.240e-02
12 22 280 0.579 7 17 12.7 -2.306e+00 -6.305e-01
13 8 48 0.750 4 8 6.0 -2.568e+00 -1.386e+00
Interpolation Matrix Information:
entries/row min max row sums
lev rows x cols min max avgW weight weight min max
======================================================================================
0 52585034 x 26055591 1 4 2.0 2.763e-02 1.000e+00 3.332e-01 1.000e+00
1 26055591 x 12619201 1 4 2.0 2.129e-02 1.000e+00 1.542e-01 1.000e+00
2 12619201 x 5768656 1 4 3.3 -6.033e-01 1.017e+00 7.873e-02 1.000e+00
3 5768656 x 2401754 0 4 3.1 -5.164e+00 6.708e+00 0.000e+00 1.003e+00
4 2401754 x 668852 0 4 3.7 -4.201e+01 3.780e+01 0.000e+00 1.393e+00
5 668852 x 162105 0 4 3.6 -6.293e+02 1.831e+03 -1.491e+00 6.027e+00
6 162105 x 36493 0 4 3.3 -6.248e+02 2.604e+02 -4.586e+01 6.053e+00
7 36493 x 7616 0 4 3.0 -4.818e+02 1.221e+03 -4.547e+00 5.583e+02
8 7616 x 1458 0 4 2.6 -6.254e+01 6.791e+01 -1.454e+00 6.791e+01
9 1458 x 309 0 4 2.3 -2.727e+00 1.390e+00 -1.818e+00 1.446e+00
10 309 x 80 0 4 2.2 -2.398e-01 9.062e-01 -1.310e-01 1.000e+00
11 80 x 22 0 4 1.6 2.241e-02 6.748e-01 0.000e+00 1.000e+00
12 22 x 8 0 3 1.3 4.289e-02 3.874e-01 0.000e+00 1.000e+00
Complexity: grid = 1.907523
operator = 3.623233
memory = 4.085052
BoomerAMG SOLVER PARAMETERS:
Maximum number of cycles: 1000
Stopping Tolerance: 1.000000e-08
Cycle type (1 = V, 2 = W, etc.): 1
Relaxation Parameters:
Visiting Grid: down up coarse
Number of sweeps: 1 1 1
Type 0=Jac, 3=hGS, 6=hSGS, 9=GE: 13 14 9
Point types, partial sweeps (1=C, -1=F):
Pre-CG relaxation (down): 0
Post-CG relaxation (up): 0
Coarsest grid: 0
However I noted it is taking 61, 79, and 54 iterations respectively for each of the above solves, whereas 2.11.1 with my earlier settings (as mentioned in my previous post) is taking about 25 iterations on average and slightly shorter in total time... from reviewing the logs above, is there anything that stands out to you that I am still not doing or missing, that could potentially reduce the # of iterations or solver time further (inputs to both are exactly identical)?
Next I plan to test with openmp using the above settings..
The trunc factor should be a number between 0 and 1, so 5 would be set to 0. I think you wanted to actually set PMaxElmts which sets the maximum number of nonzeros in the interpolation matrix. There 4 or 5 is a good number, but 4 is the default and it clearly was used in your run. A strength threshold of 0.25 should be fine for use with HMIS. The slowdown in iterations is concerning, but I am not sure what you set before and also how long it actually took you. Previous default settings generally lead to larger complexities with faster convergence but slower iteration times. It looks like you have some nasty interpolation weights. You could try to set InterpType to 17 or 18, which gives you a newer, possibly better formulation of the interpolation operator and see if that improves convergence.
Thanks @ulrikeyang et al for that correction; also on closer examination, the 2.11.1 install although took fewer iterations for each solve, it actually took longer in solver time for I- and J-direction solves, and slightly shorter for K-direction solves. The solver output for a typical solve (e.g. I-dir) in 2.11.1 test looks as below:
BoomerAMG SETUP PARAMETERS:
Max levels = 20
Num levels = 20
Strength Threshold = 0.250000
Interpolation Truncation Factor = 0.000000
Maximum Row Sum Threshold for Dependency Weakening = 0.900000
Coarsening Type = Falgout-CLJP
measures are determined locally
Interpolation = modified classical interpolation
Operator Matrix Information:
nonzero entries per row row sums
lev rows entries sparse min max avg min max
===================================================================
0 52585034 367001350 0.000 3 16 7.0 -3.789e-03 2.366e-12
1 26319841 380127341 0.000 3 46 14.4 -6.679e-03 2.714e-12
2 13447013 581127639 0.000 3 81 43.2 -1.040e-02 2.665e-12
3 5636584 419268200 0.000 4 212 74.4 -1.725e-02 4.347e-12
4 2562220 375204122 0.000 5 469 146.4 -3.162e-02 4.533e-12
5 1215332 326926178 0.000 6 1050 269.0 -6.512e-02 8.842e-05
6 571708 252852046 0.001 7 1616 442.3 -8.126e-02 1.264e-07
7 256760 165228956 0.003 9 2314 643.5 -1.261e-01 5.165e-12
8 111827 90826635 0.007 16 2744 812.2 -1.682e-01 5.541e-12
9 47253 40087457 0.018 15 2532 848.4 -3.317e-01 6.910e-12
10 19247 14491551 0.039 15 2189 752.9 -5.992e-01 8.990e-12
11 7925 5110295 0.081 30 1959 644.8 -9.279e-01 9.599e-12
12 3587 2258269 0.176 76 1851 629.6 -8.114e-01 1.004e-11
13 1766 1028498 0.330 52 1176 582.4 -7.884e-01 5.714e-12
14 892 386634 0.486 30 681 433.4 -4.166e-01 5.291e-05
15 393 99055 0.641 29 347 252.0 -4.724e-01 0.000e+00
16 169 22801 0.798 27 168 134.9 -9.272e-01 0.000e+00
17 60 3372 0.937 38 60 56.2 -1.028e+00 0.000e+00
18 20 400 1.000 20 20 20.0 -1.104e+00 0.000e+00
19 6 36 1.000 6 6 6.0 -4.855e-01 0.000e+00
Interpolation Matrix Information:
entries/row min max row sums
lev rows cols min max weight weight min max
=================================================================
0 52585034 x 26319841 1 11 4.460e-02 1.000e+00 8.430e-01 1.000e+00
1 26319841 x 13447013 1 12 3.337e-02 1.000e+00 6.404e-01 1.000e+00
2 13447013 x 5636584 1 16 2.127e-02 1.000e+00 4.350e-01 1.000e+00
3 5636584 x 2562220 1 26 7.079e-03 1.000e+00 1.037e-01 1.000e+00
4 2562220 x 1215332 1 35 4.025e-03 1.000e+00 5.754e-02 1.000e+00
5 1215332 x 571708 1 38 3.113e-03 1.000e+00 6.017e-02 1.010e+00
6 571708 x 256760 0 46 4.535e-03 1.000e+00 0.000e+00 1.000e+00
7 256760 x 111827 0 46 3.612e-03 1.000e+00 0.000e+00 1.000e+00
8 111827 x 47253 0 47 4.794e-03 1.000e+00 0.000e+00 1.000e+00
9 47253 x 19247 0 45 4.772e-03 1.000e+00 0.000e+00 1.000e+00
10 19247 x 7925 0 43 5.168e-03 1.000e+00 0.000e+00 1.000e+00
11 7925 x 3587 0 29 4.716e-03 1.000e+00 0.000e+00 1.000e+00
12 3587 x 1766 0 29 1.077e-02 1.000e+00 0.000e+00 1.000e+00
13 1766 x 892 1 22 1.889e-02 1.000e+00 1.080e-01 1.000e+00
14 892 x 393 1 13 3.265e-02 1.000e+00 5.206e-01 1.000e+00
15 393 x 169 1 14 2.649e-02 1.000e+00 1.151e-01 1.000e+00
16 169 x 60 1 7 2.438e-02 1.000e+00 9.336e-02 1.000e+00
17 60 x 20 1 4 7.697e-02 1.000e+00 4.691e-01 1.000e+00
18 20 x 6 1 3 2.029e-01 9.997e-01 2.029e-01 1.000e+00
Complexity: grid = 1.954694
operator = 8.234441
memory = 8.836133
BoomerAMG SOLVER PARAMETERS:
Maximum number of cycles: 1000
Stopping Tolerance: 1.000000e-08
Cycle type (1 = V, 2 = W, etc.): 1
Relaxation Parameters:
Visiting Grid: down up coarse
Number of sweeps: 1 1 1
Type 0=Jac, 3=hGS, 6=hSGS, 9=GE: 3 3 9
Point types, partial sweeps (1=C, -1=F):
Pre-CG relaxation (down): 1 -1
Post-CG relaxation (up): -1 1
Coarsest grid: 0
Anyway, I'll ignore that for now since it is faster 2 out of 3 times in 2.24.0, however I did one more test on 2.24.0 with replacing the strength threshold to 0.25 (default) and removing the incorrect truncation factor to see if that recovers more closely the older version trends but I see mixed trend (e.g. I-dir solve took 81 iterations compared to 61 before and took longer by ~500s, but J-dir solve was faster and took fewer iterations at 40 and faster by ~500s) so it's a mixed bag, hard to tell why old settings are working on 2.11.1 and not on the latest, whereas new ones work on 2.24.0; guess I'll keep it at 0.5 since that is recommended for 3-d; before trying your suggestion for interp types 17 or 18!
Update - interpolation type 17 gave slightly improved result over choice 6
@rfalgout @ulrikeyang et al update - so with the recommended settings for BoomerAMG and with openmp enabled solver was able to converge OK for all the solves above; in general I noted a ~2.5x speedup for I/J-dir solves and ~5x speedup for K-dir solves with 16 threads enabled. Not sure if that is along expected lines or not ... planning to try with GPU enabled next, noted that some solver options are not supported for GPU, guess one question - can we generally expect greater speedup for GPU multithreaded vs CPU (openmp)? Thank you.
GPU acceleration depends on the parameters of AMG (one should use all GPU enabled algorithms, see https://github.com/hypre-space/hypre/wiki/GPUs) and the size of problem per GPU (in general the larger the better as long as it can fit the memory).
Thanks @liruipeng et al, When I try to configure with GPU, I see the following cmake configuration warning in Windows 10:
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/bin/nvcc.exe
-- Looking for a CUDA host compiler - C:/Program Files (x86)/Microsoft Visual Studio/2019/Professional/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe
CMake Warning at C:/ProgramData/Miniconda3/Lib/site-packages/cmake/data/share/cmake-3.22/Modules/CMakeDetermineCUDACompiler.cmake:15 (message):
Visual Studio does not support specifying CUDAHOSTCXX or
CMAKE_CUDA_HOST_COMPILER. Using the C++ compiler provided by Visual
Studio.
Following that it reports that it detected the CUDA compiler and the CUDA toolkit:
-- The CUDA compiler identification is NVIDIA 11.6.124
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Enabled support for CUDA.
-- Using CUDA architecture: 70
-- Found CUDA: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6 (found version "11.6")
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/include (found version "11.6.124")
-- Configuring done
-- Generating done
-- Build files have been written to: E:/devl/hypre/hypre-2.24.0/src/cuda_build
Can the above warning be ignored? Can I still launch the solution with VS and build/install HYPRE as usual, or do I have to additionally modify some settings of the HYPRE project in the VS solution?
EDIT: I tried building from the VS solution, and seems to be working as intended, it is building the relevant files of the project with the CUDA compiler. However I did see errors complaining about some CUDA pre-compiler directives, for one of the files:
2>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(187): error : "#" not expected here
2>
2>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(187): error : expected an expression
2>
2>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(187): error : too many arguments in function call
2>
2>14 errors detected in the compilation of "E:/devl/hypre/hypre-2.24.0/src/seq_mv/csr_matvec_device.c".
Similarly for the next call:
1>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(183): error : "#" not expected here
1>
1>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(183): error : expected an expression
1>
1>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(183): error : too many arguments in function call
1>
1>Done building project "HYPRE.vcxproj".
1>7 errors detected in the compilation of "E:/devl/hypre/hypre-2.24.0/src/seq_mv/csr_matvec_device.c".
After deleting/commenting the inactive lines of the directive in question, it seems to compile OK.
Then I tried building the test project ij and ran into this:
1>ij.c
1>E:\devl\hypre\hypre-2.24.0\src\test\ij.c(952): fatal error C1061: compiler limit: blocks nested too deeply
Got past those by commenting some of the conditional checks, but running gives this error:
Running with these driver parameters:
solver ID = 0
CUDA ERROR (code = 35, CUDA driver version is insufficient for CUDA runtime version) at E:\devl\hypre\hypre-2.24.0\src\utilities\general.c:194