underworld2 icon indicating copy to clipboard operation
underworld2 copied to clipboard

Issue on restart

Open PatriceFRey opened this issue 1 year ago • 3 comments

G'day all, For quite a while now we have faced an issue with restarting some experiments (in 2D and 3D). Upon restart, the model kickstart alright, and may even progress to the next timestep successfully, before getting stuck. In the log below, the restart time was 16 myr, and the model progressed to the next time step (16.1 myr) with no problem, before stalling. There is no error message because the code is still running, but there is no output after ~24 hours. Until now, we have dealt with this by restarting the experiment using a different number or CPU. In the example below running on Gadi, we fixed the problem by increasing the number of CPU from 48 to 96. This fix is ok in 2D, but in 3D it is incompatible with Badlands, which requires running the restart on the same number of CPUs. I am curious to hear if other users have experienced the same issue. Cheers

..... Step: 39 Model Time: 16.1 megayear dt: 3564.9 year (2022-07-29 21:59:27) In SystemLinearEquations_NonLinearExecute

Non linear solver - iteration 0 Linear solver (DE0GNTZ5__system-execute)

BSSCR -- Block Stokes Schur Compliment Reduction Solver

----- K2_GMG ------ ----- K2_GMG ------

AUGMENTED LAGRANGIAN K2 METHOD - Penalty = 1000000.000000

    * K+p*K2 in time: 0.002640 seconds

Setting schur_pc to "gkgdiag"

SCR Solver Summary:

RHS V Solve: = 0.0172 secs / 1 its Pressure Solve: = 0.01883 secs / 2 its Final V Solve: = 0.007508 secs / 1 its

Total BSSCR Linear solve time: 0.878022 seconds

Linear solver (DE0GNTZ5__system-execute), solution time 8.789609e-01 (secs) Non linear solver - iteration 1 Linear solver (DE0GNTZ5__system-execute)

BSSCR -- Block Stokes Schur Compliment Reduction Solver

----- K2_GMG ------

AUGMENTED LAGRANGIAN K2 METHOD - Penalty = 1000000.000000

    * K+p*K2 in time: 0.002669 seconds

Setting schur_pc to "gkgdiag"

SCR Solver Summary:

RHS V Solve: = 0.01736 secs / 1 its Pressure Solve: = 0.01915 secs / 2 its Final V Solve: = 0.007638 secs / 1 its

Total BSSCR Linear solve time: 0.869927 seconds

Linear solver (DE0GNTZ5__system-execute), solution time 8.708969e-01 (secs) In func SystemLinearEquations_NonLinearExecute: Iteration 1 of 500 - Residual 0.003602 - Tolerance = 0.01 Non linear solver - Residual 3.60200414e-03; Tolerance 1.0000e-02 - Converged - 1.993106e+00 (secs)

Non linear solver - iteration 2 Linear solver (DE0GNTZ5__system-execute)

BSSCR -- Block Stokes Schur Compliment Reduction Solver

----- K2_GMG ------

AUGMENTED LAGRANGIAN K2 METHOD - Penalty = 1000000.000000

    * K+p*K2 in time: 0.002646 seconds

Setting schur_pc to "gkgdiag"

SCR Solver Summary:

RHS V Solve: = 0.01736 secs / 1 its Pressure Solve: = 0.01915 secs / 2 its Final V Solve: = 0.007638 secs / 1 its

Total BSSCR Linear solve time: 0.869927 seconds

Linear solver (DE0GNTZ5__system-execute), solution time 8.708969e-01 (secs) In func SystemLinearEquations_NonLinearExecute: Iteration 1 of 500 - Residual 0.003602 - Tolerance = 0.01 Non linear solver - Residual 3.60200414e-03; Tolerance 1.0000e-02 - Converged - 1.993106e+00 (secs)

Non linear solver - iteration 2 Linear solver (DE0GNTZ5__system-execute)

BSSCR -- Block Stokes Schur Compliment Reduction Solver

----- K2_GMG ------

AUGMENTED LAGRANGIAN K2 METHOD - Penalty = 1000000.000000

    * K+p*K2 in time: 0.002646 seconds

Setting schur_pc to "gkgdiag"

SCR Solver Summary:

RHS V Solve: = 0.01549 secs / 1 its Pressure Solve: = 0.01932 secs / 2 its Final V Solve: = 0.007691 secs / 1 its

Total BSSCR Linear solve time: 0.861784 seconds

Linear solver (DE0GNTZ5__system-execute), solution time 8.627349e-01 (secs) In func SystemLinearEquations_NonLinearExecute: Iteration 2 of 500 - Residual 0.0026953 - Tolerance = 0.01 Non linear solver - Residual 2.69528805e-03; Tolerance 1.0000e-02 - Converged - 2.977151e+00 (secs)

In func SystemLinearEquations_NonLinearExecute: Converged after 2 iterations. Linear solver (IJ8VUA9C__system-execute) Linear solver (IJ8VUA9C__system-execute), solution time 4.675663e-02 (secs) Time Integration 2nd order: 3X134A6I__integrand - 0.0342 [min] / 0.0457 [max] (secs) Time Integration - 0.0455469 [min] / 0.0457304 [max] (secs) Time Integration 2nd order: UCJCLDRK__integrand - 0.0000 [min] / 0.0000 [max] (secs) Time Integration - 0.000176123 [min] / 0.000232391 [max] (secs) Time Integration 2nd order: FXB31N04__integrand - 0.0000 [min] / 0.0000 [max] (secs) Time Integration - 9.1683e-05 [min] / 0.000136983 [max] (secs) Time Integration 2nd order: TVPD1BDF__integrand - 0.0000 [min] / 0.0066 [max] (secs) Time Integration - 0.00660362 [min] / 0.00664722 [max] (secs)

PatriceFRey avatar Jul 29 '22 22:07 PatriceFRey

Hi Patrice!

Does the model include advection/diffusion? If so, are you using SUPG or semi-lagrangian?
If SUPG, are you reloading the phiDotField field at restart?

It might be useful to wrap the viscosity in a fn.view.min_max() to check that it is > 0.

jmansour avatar Jul 30 '22 01:07 jmansour

Hi John, thanks for the reply (on a Sunday!). Yep, we are using SUPG by default, as for the reloading we also use the default options. Ok, I'll try the viscosity view.min_max to check that there are no negative values. Thanks!

PatriceFRey avatar Jul 31 '22 00:07 PatriceFRey

Hi Patrice, I remember we talked about this issue a month back and we turned off the passive tracers as they were causing issues after restart. Is this new case with the tracers particles removed?

@jmansour :+1: about the phiDotField. If you're using SUPG. @patrice-rey are you able to use SLCN instead for this run? Note SLCN can't handle non orthogonal meshes. see, https://www.underworldcode.org/articles/underworld-release-2-8/

julesghub avatar Jul 31 '22 23:07 julesghub