nalu-wind Memory leak in in SolverAlgorithm

We've had a memory leak showing up in the nightly dashboard for a while. https://my.cdash.org/test/61728724

This is the same leak in the unit tests and I've traced it to SolverAlgorithm.C:L60.

   4: Indirect leak of 51200 byte(s) in 80 object(s) allocated from:                                                                                                                                                                                                                                                                                                      
   4:     #0 0xc5993f in malloc /projects/cde/v3/cee/spack/var/spack/stage/spack-stage-llvm-12.0.1-4ld5bciw4oetcntm2hgn23pklymf55ou/spack-src/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3                                                                                                                                                                            
   4:     #1 0x7f663a2dd482 in Kokkos::HostSpace::impl_allocate(char const*, unsigned long, unsigned long, Kokkos_Profiling_SpaceHandle) const (/home/psakiev/soft/spack-manager/spack/opt/spack/linux-rhel7-x86_64/clang-12.0.1/trilinos-stable-xvbe2e6a3e3tbckfyb2r3ap3cz2twdr2/lib/libkokkoscore.so.13+0x3e482)                                                        
   4:     #2 0x2212a6a in sierra::nalu::NGPApplyCoeff::NGPApplyCoeff(sierra::nalu::EquationSystem*) /scratch/psakiev/clang-debug/nalu-wind/src/SolverAlgorithm.C:60:39                                                                                                                                                                                                    
   4:     #3 0x206055e in sierra::nalu::SolverAlgorithm::coeff_applier() /scratch/psakiev/clang-debug/nalu-wind/include/SolverAlgorithm.h:89:42                                                                                                                                                                                                                           
   4:     #4 0x206055e in sierra::nalu::AssembleFaceElemSolverAlgorithm::execute() /scratch/psakiev/clang-debug/nalu-wind/src/AssembleFaceElemSolverAlgorithm.C:93:23                                                                                                                                                                                                     
   4:     #5 0x2216d8b in sierra::nalu::SolverAlgorithmDriver::execute() /scratch/psakiev/clang-debug/nalu-wind/src/SolverAlgorithmDriver.C:123:18                                                                                                                                                                                                                        
   4:     #6 0x215f8fd in sierra::nalu::EquationSystem::assemble_and_solve(stk::mesh::FieldBase*) /scratch/psakiev/clang-debug/nalu-wind/src/EquationSystem.C:314:21                                                                                                                                                                                                      
   4:     #7 0xd19545 in sierra::nalu::MomentumEquationSystem::assemble_and_solve(stk::mesh::FieldBase*) /scratch/psakiev/clang-debug/nalu-wind/src/LowMachEquationSystem.C:2674:19                                                                                                                                                                                       
   4:     #8 0xcee0f3 in sierra::nalu::LowMachEquationSystem::solve_and_update() /scratch/psakiev/clang-debug/nalu-wind/src/LowMachEquationSystem.C:743:23                                                                                                                                                                                                                
   4:     #9 0xccd7bb in sierra::nalu::EquationSystems::solve_and_update() /scratch/psakiev/clang-debug/nalu-wind/src/EquationSystems.C:794:12                                                                                                                                                                                                                            
   4:     #10 0xeadc2b in sierra::nalu::Realm::nonlinear_iterations(int) /scratch/psakiev/clang-debug/nalu-wind/src/Realm.C:1948:47                                                                                                                                                                                                                                       
   4:     #11 0xea9bd8 in sierra::nalu::Realm::advance_time_step() /scratch/psakiev/clang-debug/nalu-wind/src/Realm.C:1933:3                                                                                                                                                                                                                                              
   4:     #12 0xcb7cf2 in sierra::nalu::TimeIntegrator::integrate_realm() /scratch/psakiev/clang-debug/nalu-wind/src/TimeIntegrator.C:392:16                                                                                                                                                                                                                              
   4:     #13 0xc964ec in main /scratch/psakiev/clang-debug/nalu-wind/nalu.C:193:9

Sep 09 '22 22:09 psakievich

@ldh4 this looks like it could be related to work you did in #990?

Sep 09 '22 22:09 psakievich

I can't seem to access the dashboard right now. I am getting:

Network Error (tcp_error)

A communication error occurred: "" The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.

Basically, I'd like to know when this leak started to show up.

#987 was the fix for the memory leak revolving around NGPApplyCoeff that led to the segfault, fixed by #990. I believe I checked that the leak isn't back when I pushed #990, but I can check again.

Sep 09 '22 22:09 ldh4

@ldh4 it goes back as far as the dashboard stores information. Which admittedly isn't very far. I'm setting up on my mac now to see if I can reproduce there. I was able to reproduce on ascicgpu22 with the spec nalu-wind@master+asan+openfast+hypre build_type=RelWithDebInfo.

Sep 09 '22 22:09 psakievich

I do not see this memory leak on my mac with asan turned on. Interesting.

Sep 10 '22 04:09 psakievich

@ldh4 I've been reviewing this further with @PaulMullowney and I think there are some design changes from #937 that need to be refactored to unify TeptraLinearSystem and HypreLinearSystem's CoeffApplier implementations. We ran the hypre only test for this case and it does not show memory leaks. I suspect the root cause of these leaks are a missing free for the CoeffApplier, but it would make things less confusing if we unified the design a little more.

In particular we think the device_pointer function(s) in TpetraLinearSystem should be restored

from https://github.com/Exawind/nalu-wind/blob/f8508d410957cd9a0c704579101077abcd6fcc2a/include/TpetraLinearSystem.h#L256-L258
to https://github.com/Exawind/nalu-wind/blob/d24dd3be634f51baa3016d0557ba1a2d531039b0/src/TpetraLinearSystem.C#L1375-L1397

and the asymmetry in the NGPApplyCoeff's method for removing memory

https://github.com/Exawind/nalu-wind/blob/f8508d410957cd9a0c704579101077abcd6fcc2a/src/SolverAlgorithm.C#L81 Looking at #990 it seems like owns_coeff_applier would be better named is_hypre_instance. It would be nice to get these two using the same strategy for memory management if possible and not have to bifurcate.

Pinging @alanw0 so we can get this on the STK team workload.

Sep 12 '22 22:09 psakievich

I agree that TpetraLinearSystem should be refactored further, specifically regarding CoeffApplier. The main difficulty I noticed while modifying TpetraLinearSystem for #937 was from Tpetra's restriction for variable reference holding, which led us to make this workaround for how CoeffAppliers are created and destroyed. It's something we should take another look at.

Sep 12 '22 22:09 ldh4

@ldh4 yes that sounds good to me. It might be worth reviewing the variable reference issues with @PaulMullowney in a mob programming session so we can get his perspective from the HypreLinearSystem work he's done.

Sep 12 '22 22:09 psakievich

ok we'll get a story on the backlog to take care of this.

Sep 13 '22 02:09 alanw0

@alanw0 -- is there any news on this issue?

Jul 08 '24 20:07 michaelasprague

No. This may have been lost in the rush to get things working on crusher/frontier... @ldh4 What was the resolution of this issue?

Jul 08 '24 20:07 alanw0

@alanw0 I don't know we made a strong action to further resolve this. Based on my memory, we had concluded that the differences in implementation designs between HypreLinearSystem and TpetraLinearSystem are partially contributing to this recurring memory leak/segfault issue.

I recall there being further discussions that HypreLinearSystem is the favored one at the time and that TpetraLinearSystem should be refactored to have similar code flow as HypreLInearSystem, which would have dissolved this recurring issue. But I do not know if that refactoring took a place eventually.

Jul 08 '24 20:07 ldh4

@michaelasprague Dong Hun's comment confirms my suspicion that this memory-leak fix got lower priority when we removed Trilinos solvers from the default build as we were prioritizing the AMD/ROCm work for crusher/frontier. Shall we close this issue as "won't fix"?

Jul 08 '24 20:07 alanw0

I defer to @psakievich on if it belongs in won't fix

Jul 08 '24 21:07 michaelasprague

I think backlogged would be a better choice over won't fix. Currently the vast majority of our tests are configured with tpetra solvers and our linear solvers bench is shallower than it used to be.

Unless we plan to totally abandon tpetra solvers I would say a plan should be made to fix this and it would be good to assign to the next person who steps into the role of linear solvers lead.

Jul 09 '24 01:07 psakievich

nalu-wind nalu-wind copied to clipboard

Memory leak in in SolverAlgorithm

Network Error (tcp_error)

nalu-wind
nalu-wind copied to clipboard