Albany icon indicating copy to clipboard operation
Albany copied to clipboard

Seg Fault when trying to use new MueLu preconditioners for Enthalpy problems

Open mperego opened this issue 3 years ago • 16 comments

I get a segmentation fault when using the new MueLu settings provided by Ray Tuminaro for the Humboldt problem. I tried different settings, below is the error for P1semR1transP2const:

2: ************************************************************************
2: -- Nonlinear Solver Step 0 -- 
2: ||F|| = 7.439e+02  step = 0.000e+00  dx = 0.000e+00
2: ************************************************************************
2: 
2:  Phalanx writing graphviz file for graph of FM0Jacobian (detail = 2)
2:  Process using 'dot -Tpng -O phalanxGraphFM0Jacobian
2:  ************* Phalanx Setup **************
2:  ************ Evaluation Types ************
2:    FM0Jacobian
2:    DFM0Residual
2:    FM0Residual
2:  
2:  ******************************************
2:  Phalanx writing graphviz file for graph of DFM0Jacobian (detail = 2)
2:  Process using 'dot -Tpng -O phalanxGraphDFM0Jacobian
2:  ************* Phalanx Setup **************
2:  ************ Evaluation Types ************
2:    DFM0Jacobian
2:    FM0Jacobian
2:    DFM0Residual
2:    FM0Residual
2:  
2:  ******************************************
2: --------------------------------------------------------------------------
2: Primary job  terminated normally, but 1 process returned
2: a non-zero exit code. Per user-direction, the job has been aborted.
2: --------------------------------------------------------------------------
2: --------------------------------------------------------------------------
2: mpiexec noticed that process rank 3 with PID 0 on node s1026095 exited on signal 11 (Segmentation fault).

I didn't get much info running dbg. To reproduce the error, build branch https://github.com/sandialabs/Albany/tree/enthalpy_muelu and run the Enthalpy tests: ctest -R Enthalpy_Humboldt_MueLu

mperego avatar Nov 15 '22 21:11 mperego

@mperego I have an Albany executable on Perlmutter, but that's probably not the easiest platform to debug on. Is there another machine you'd suggest I build on?

jhux2 avatar Nov 15 '22 22:11 jhux2

Thanks @jhux2. You could use blake. We have scripts for building Trilinos and Albany. I think you can use the gcc modules blake_gcc_modules_submit.sh and cmake scripts, do-cmake-trilinos-gcc-serial, do-cmake-albany-serial. -- I got the error with gcc compiler. @jewatkins do you have better advise?

mperego avatar Nov 15 '22 23:11 mperego

blake is probably the best option right now. The gcc build is a debug build so it will run slow but it might give you more information. You can use the binary directly: /home/projects/albany/nightlyCDashAlbanyBlake/build-gcc/AlbBuildSerialGccNoWarn/src/Albany or use the trilinos install /home/projects/albany/nightlyCDashTrilinosBlake/build-gcc/TrilinosSerialInstallGccNoWarn/

jewatkins avatar Nov 15 '22 23:11 jewatkins

I've run the Humboldt test that's on the main branch, just as a sanity check. This uses the executable that @jewatkins pointed to. Right after the stacked timer output, which I assume comes the end of the simulation, there are a few errors. Are these to be expected?

|   Albany Fill: State Residual: 0.00712972 - 0.011611% [1]
|   |   Phalanx::SortAndOrderEvaluators: 8.958e-06 - 0.125643% [5]
|   |   Remainder: 0.00712076 - 99.8744%
|   Albany: Output to File: 0.298793 - 0.486596% [1]
|   Remainder: 0.178301 - 0.29037%

***
*** Warning! The following Teuchos::RCPNode objects were created but have
*** not been destroyed yet.  A memory checking tool may complain that these
*** objects are not destroyed correctly.

jhux2 avatar Nov 16 '22 00:11 jhux2

Yes looks like it: https://sems-cdash-son.sandia.gov/cdash/test/3060119 We should probably look into why that's happening. The final result looks correct though.

jewatkins avatar Nov 16 '22 00:11 jewatkins

@jhux2 any updates on this?

mperego avatar Jan 24 '23 16:01 mperego

@mperego Sorry, I've not looked at this in a while. I'll pick this back up.

jhux2 avatar Jan 24 '23 17:01 jhux2

@mperego I updated your branch with master and am seeing the following error. Has parsing of ice_thickness changed somehow?

180: ***************************************************************
180: **  ______   __       ______   ______   __   __   __  __     **
180: ** /\  __ \ /\ \     /\  == \ /\  __ \ /\ "-.\ \ /\ \_\ \    **
180: ** \ \  __ \\ \ \____\ \  __< \ \  __ \\ \ \-.  \\ \____ \   **
180: **  \ \_\ \_\\ \_____\\ \_____\\ \_\ \_\\ \_\\"\_\\/\_____\  **
180: **   \/_/\/_/ \/_____/ \/_____/ \/_/\/_/ \/_/ \/_/ \/_____/  **
180: **                                                           **
180: ***************************************************************
180: ** Trilinos git commit id - 62bb6ac4a8e
180: ** Albany git branch ------ enthalpy_muelu
180: ** Albany git commit id --- 75e0b13ba
180: ** Albany cxx compiler ---- GNU 10.1.0
180: ** Albany FadType --------- DFad
180: ** Albany TanFadType ------ DFad
180: ** Albany HessianVecFad  -- DFad
180: ** Simulation start time -- 2023-02-06 at 14:10:52
180: ***************************************************************
180:
180: p=1: *** Caught standard std::exception of type 'Teuchos::Exceptions::InvalidParameterName' :
180:
180:  Error, the parameter {name="Required Fields",type="Array(string)",value="{ice_thickness}"}
180:  in the parameter (sub)list "Albany Parameters->Problem"
180:  was not found in the list of valid parameters!
180:
180:  The valid parameters and types are:
180:    {
180:      "Name" : string =
180:      "Number of Spatial Processors" : int = -1
180:      "Phalanx Graph Visualization Detail" : int = 0
180:      "Use Physics-Based Preconditioner" : bool = 0
180:      "Physics-Based Preconditioner" : string = None
180:      "Initial Condition" : ParameterList = ...
180:      "Initial Condition Dot" : ParameterList = ...
180:      "Initial Condition DotDot" : ParameterList = ...
180:      "Source Functions" : ParameterList = ...
180:      "Absorption" : ParameterList = ...
180:      "Response Functions" : ParameterList = ...
180:      "Parameters" : ParameterList = ...
180:      "Random Parameters" : ParameterList = ...
180:      "Linear Combination Parameters" : ParameterList = ...
180:      "LogNormal Parameter" : ParameterList = ...
180:      "Teko" : ParameterList = ...
180:      "Hessian" : ParameterList = ...
180:      "XFEM" : ParameterList = ...
180:      "Dirichlet BCs" : ParameterList = ...
180:      "Neumann BCs" : ParameterList = ...
180:      "Adaptation" : ParameterList = ...
180:      "Overwrite Nominal Values With Final Point" : bool = 0
180:      "Number Of Time Derivatives" : int = 1
180:      "Use MDField Memoization" : bool = 0
180:      "Use MDField Memoization For Parameters" : bool = 0
180:      "Ignore Residual In Jacobian" : bool = 0
180:      "Perturb Dirichlet" : double = 0
180:      "Solution Method" : string = Steady
180:      "Homotopy Restart Step" : double = 1
180:      "Second Order" : string = No
180:      "Print Response Expansion" : bool = 1
180:      "Compute Sensitivities" : bool = 1
180:      "Constitutive Model NOX Status Test" : Teuchos::RCP<NOX::StatusTest::Generic> = Teuchos::RCP<NOX::StatusTest::Generic>{ptr=0,node=0,strong_count=0,weak_count=0}
180:      "LandIce Physical Parameters" : ParameterList = ...
180:      "LandIce Enthalpy" : ParameterList = ...
180:      "LandIce Viscosity" : ParameterList = ...
180:      "Stereographic Map" : ParameterList = ...
180:      "Basal Side Name" : string =
180:      "Needs Dissipation" : bool = 1
180:      "Needs Basal Friction" : bool = 1
180:    }
180:
180:
180:  Throw number = 1
180:

jhux2 avatar Feb 06 '23 21:02 jhux2

@jhux2, we cleaned a bit the code. Please remove these lines:

    Required Fields: [ice_thickness]
    Required Basal Fields: [ice_thickness]

Element Shape: Wedge

mperego avatar Feb 06 '23 21:02 mperego

Thanks, @mperego. Another error, I guess masked by the first:

    Start 180: landIce_Enthalpy_Humboldt_MueLu_P1semiR1transP2const

180: Test command: /projects/sems/install/rhel7-x86_64/sems/v2/tpl/openmpi/4.0.5/gcc/10.1.0/base/e64jpaw/bin/mpiexec "-np" "4" "/scratch/jhu/fanssie/build-albany-relwithdebinfo/src/Albany" "input_enthalpy_humboldt_muelu_P1semiR1transP2const.yaml"
180: Working Directory: /scratch/jhu/fanssie/build-albany-relwithdebinfo/tests/landIce/Enthalpy
180: Test timeout computed to be: 1500
180: ***************************************************************
180: **  ______   __       ______   ______   __   __   __  __     **
180: ** /\  __ \ /\ \     /\  == \ /\  __ \ /\ "-.\ \ /\ \_\ \    **
180: ** \ \  __ \\ \ \____\ \  __< \ \  __ \\ \ \-.  \\ \____ \   **
180: **  \ \_\ \_\\ \_____\\ \_____\\ \_\ \_\\ \_\\"\_\\/\_____\  **
180: **   \/_/\/_/ \/_____/ \/_____/ \/_/\/_/ \/_/ \/_/ \/_____/  **
180: **                                                           **
180: ***************************************************************
180: ** Trilinos git commit id - 62bb6ac4a8e
180: ** Albany git branch ------ enthalpy_muelu
180: ** Albany git commit id --- 75e0b13ba
180: ** Albany cxx compiler ---- GNU 10.1.0
180: ** Albany FadType --------- DFad
180: ** Albany TanFadType ------ DFad
180: ** Albany HessianVecFad  -- DFad
180: ** Simulation start time -- 2023-02-06 at 14:31:21
180: ***************************************************************
180: Albany_IOSS: Loading STKMesh from Exodus file  ../AsciiMeshes/Humboldt/humboldt_2d.exo
180:
180: IOSS: Using decomposition method 'RIB' for 2,611 elements on 4 mpi ranks.
180:
180: p=3: *** Caught standard std::exception of type 'Teuchos::Exceptions::InvalidParameterValue' :
180:
180:  /ascldap/users/jhu/fanssie/sources/Albany/src/disc/stk/Albany_ExtrudedSTKMeshStruct.cpp:136:
180:
180:  Throw number = 1
180:
180:  Throw test that evaluated to true: basalside_elem_name != elem2d_name
180:
180:
180:  Error in ExtrudedSTKMeshStruct: Expecting topology name of elements of 2d mesh to be Quadrilateral_4 but it is Triangle_3

jhux2 avatar Feb 06 '23 21:02 jhux2

@jhux2 I guess you merged with master before #888 got merged. If so, you need to put back Element Shape: Wedge

Let me know if this is not the issue

mperego avatar Feb 06 '23 21:02 mperego

@mperego That seems to have fixed it, I'm now back to the original error you reported. Thanks.

jhux2 avatar Feb 06 '23 21:02 jhux2

@mperego Here's a quick update. MueLu's setup is recursing until it exhausts stack memory, and one of the processes seg faults. I'm sifting through factory dependency information at the moment to see what's going wrong.

jhux2 avatar Feb 07 '23 22:02 jhux2

@jhux2 thanks for looking into that! It doesn't sound fun..

mperego avatar Feb 07 '23 22:02 mperego

@jhux2 are there any updates on this issue?

mperego avatar May 18 '23 20:05 mperego

Hi @jhux, there have been some changes in Albany that needs to be merged in this branch. A few additional changes are needed in the input files as well. Let me know when you plan to look into this and I'll do the merge and fix the input files.

mperego avatar Jun 07 '23 17:06 mperego