nalu-wind
nalu-wind copied to clipboard
Memory leak in in SolverAlgorithm
We've had a memory leak showing up in the nightly dashboard for a while. https://my.cdash.org/test/61728724
This is the same leak in the unit tests and I've traced it to SolverAlgorithm.C:L60.
4: Indirect leak of 51200 byte(s) in 80 object(s) allocated from:
4: #0 0xc5993f in malloc /projects/cde/v3/cee/spack/var/spack/stage/spack-stage-llvm-12.0.1-4ld5bciw4oetcntm2hgn23pklymf55ou/spack-src/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3
4: #1 0x7f663a2dd482 in Kokkos::HostSpace::impl_allocate(char const*, unsigned long, unsigned long, Kokkos_Profiling_SpaceHandle) const (/home/psakiev/soft/spack-manager/spack/opt/spack/linux-rhel7-x86_64/clang-12.0.1/trilinos-stable-xvbe2e6a3e3tbckfyb2r3ap3cz2twdr2/lib/libkokkoscore.so.13+0x3e482)
4: #2 0x2212a6a in sierra::nalu::NGPApplyCoeff::NGPApplyCoeff(sierra::nalu::EquationSystem*) /scratch/psakiev/clang-debug/nalu-wind/src/SolverAlgorithm.C:60:39
4: #3 0x206055e in sierra::nalu::SolverAlgorithm::coeff_applier() /scratch/psakiev/clang-debug/nalu-wind/include/SolverAlgorithm.h:89:42
4: #4 0x206055e in sierra::nalu::AssembleFaceElemSolverAlgorithm::execute() /scratch/psakiev/clang-debug/nalu-wind/src/AssembleFaceElemSolverAlgorithm.C:93:23
4: #5 0x2216d8b in sierra::nalu::SolverAlgorithmDriver::execute() /scratch/psakiev/clang-debug/nalu-wind/src/SolverAlgorithmDriver.C:123:18
4: #6 0x215f8fd in sierra::nalu::EquationSystem::assemble_and_solve(stk::mesh::FieldBase*) /scratch/psakiev/clang-debug/nalu-wind/src/EquationSystem.C:314:21
4: #7 0xd19545 in sierra::nalu::MomentumEquationSystem::assemble_and_solve(stk::mesh::FieldBase*) /scratch/psakiev/clang-debug/nalu-wind/src/LowMachEquationSystem.C:2674:19
4: #8 0xcee0f3 in sierra::nalu::LowMachEquationSystem::solve_and_update() /scratch/psakiev/clang-debug/nalu-wind/src/LowMachEquationSystem.C:743:23
4: #9 0xccd7bb in sierra::nalu::EquationSystems::solve_and_update() /scratch/psakiev/clang-debug/nalu-wind/src/EquationSystems.C:794:12
4: #10 0xeadc2b in sierra::nalu::Realm::nonlinear_iterations(int) /scratch/psakiev/clang-debug/nalu-wind/src/Realm.C:1948:47
4: #11 0xea9bd8 in sierra::nalu::Realm::advance_time_step() /scratch/psakiev/clang-debug/nalu-wind/src/Realm.C:1933:3
4: #12 0xcb7cf2 in sierra::nalu::TimeIntegrator::integrate_realm() /scratch/psakiev/clang-debug/nalu-wind/src/TimeIntegrator.C:392:16
4: #13 0xc964ec in main /scratch/psakiev/clang-debug/nalu-wind/nalu.C:193:9
@ldh4 this looks like it could be related to work you did in #990?
I can't seem to access the dashboard right now. I am getting:
Network Error (tcp_error)
A communication error occurred: "" The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.
Basically, I'd like to know when this leak started to show up.
#987 was the fix for the memory leak revolving around NGPApplyCoeff
that led to the segfault, fixed by #990.
I believe I checked that the leak isn't back when I pushed #990, but I can check again.
@ldh4 it goes back as far as the dashboard stores information. Which admittedly isn't very far. I'm setting up on my mac now to see if I can reproduce there. I was able to reproduce on ascicgpu22
with the spec nalu-wind@master+asan+openfast+hypre build_type=RelWithDebInfo
.

I do not see this memory leak on my mac with asan
turned on. Interesting.
@ldh4 I've been reviewing this further with @PaulMullowney and I think there are some design changes from #937 that need to be refactored to unify TeptraLinearSystem
and HypreLinearSystem
's CoeffApplier
implementations. We ran the hypre only test for this case and it does not show memory leaks. I suspect the root cause of these leaks are a missing free for the CoeffApplier
, but it would make things less confusing if we unified the design a little more.
In particular we think the device_pointer
function(s) in TpetraLinearSystem
should be restored
- from https://github.com/Exawind/nalu-wind/blob/f8508d410957cd9a0c704579101077abcd6fcc2a/include/TpetraLinearSystem.h#L256-L258
- to https://github.com/Exawind/nalu-wind/blob/d24dd3be634f51baa3016d0557ba1a2d531039b0/src/TpetraLinearSystem.C#L1375-L1397
and the asymmetry in the NGPApplyCoeff
's method for removing memory
- https://github.com/Exawind/nalu-wind/blob/f8508d410957cd9a0c704579101077abcd6fcc2a/src/SolverAlgorithm.C#L81
Looking at #990 it seems like
owns_coeff_applier
would be better namedis_hypre_instance
. It would be nice to get these two using the same strategy for memory management if possible and not have to bifurcate.
Pinging @alanw0 so we can get this on the STK team workload.
I agree that TpetraLinearSystem
should be refactored further, specifically regarding CoeffApplier
. The main difficulty I noticed while modifying TpetraLinearSystem
for #937 was from Tpetra's restriction for variable reference holding, which led us to make this workaround for how CoeffAppliers
are created and destroyed. It's something we should take another look at.
@ldh4 yes that sounds good to me. It might be worth reviewing the variable reference issues with @PaulMullowney in a mob programming session so we can get his perspective from the HypreLinearSystem work he's done.
ok we'll get a story on the backlog to take care of this.
@alanw0 -- is there any news on this issue?
No. This may have been lost in the rush to get things working on crusher/frontier... @ldh4 What was the resolution of this issue?
@alanw0 I don't know we made a strong action to further resolve this.
Based on my memory, we had concluded that the differences in implementation designs between HypreLinearSystem
and TpetraLinearSystem
are partially contributing to this recurring memory leak/segfault issue.
I recall there being further discussions that HypreLinearSystem
is the favored one at the time and that TpetraLinearSystem
should be refactored to have similar code flow as HypreLInearSystem
, which would have dissolved this recurring issue. But I do not know if that refactoring took a place eventually.
@michaelasprague Dong Hun's comment confirms my suspicion that this memory-leak fix got lower priority when we removed Trilinos solvers from the default build as we were prioritizing the AMD/ROCm work for crusher/frontier. Shall we close this issue as "won't fix"?
I defer to @psakievich on if it belongs in won't fix
I think backlogged would be a better choice over won't fix. Currently the vast majority of our tests are configured with tpetra solvers and our linear solvers bench is shallower than it used to be.
Unless we plan to totally abandon tpetra solvers I would say a plan should be made to fix this and it would be good to assign to the next person who steps into the role of linear solvers lead.