Umpire icon indicating copy to clipboard operation
Umpire copied to clipboard

Out-of-memory error during program teardown deallocations

Open nselliott opened this issue 1 year ago • 4 comments

Describe the bug

We have a SAMRAI test problem running on CPUs only that allocates most of its arrays for numerical data using QuickPool host allocators. When deallocating those arrays during program teardown, we hit an out-of-memory error when QuickPool goes into do_coalesce() and tries to malloc a large chunk of memory.

To Reproduce

I have provided a reproducer and build/run instructions to @mcfadden8 .

Expected behavior

We did not expect a call to umpire::Allocator::deallocate() to cause an allocation call that hits an OOM error.

Compilers & Libraries (please complete the following information): Using umpire 2023.06.0

  • Compiler & version: Reproducer has been provided using gcc 10.3.1 on TOSS4. I don't believe this is unique to a particular compiler/platform.

Additional context

We have a workaround that makes CPU-only runs use a default host allocator instead of a QuickPool-based allocator. This is successful, but we would like our CPU-only tests to use QuickPool, as we use QuickPool on GPUs and want to keep the code base for CPU and GPU unified wherever possible. We also don't know if this is a bug that could also happen on allocation/deallocation of GPU data, though we have not seen this kind of error on a GPU run.

nselliott avatar Aug 19 '23 20:08 nselliott

Thank you for writing this up @nselliott, we are tracking this issue here: https://rzlc.llnl.gov/gitlab/umpire/umpire/-/issues/12

I've been able to reproduce the issue and am investigating the cause. It is normal behavior for umpire to coalesce blocks of pool memory as they become available during deallocation time. The amount of memory that Umpire is attempting to allocate that causes the OOM appears to be a bogus (extremely large) amount. I'm instrumenting the library to determine where the internal accounting is going wrong.

I am glad to hear that you are able to temporarily work around this issue while we work on a fix.

mcfadden8 avatar Aug 20 '23 13:08 mcfadden8

https://github.com/LLNL/Umpire/pull/845

mcfadden8 avatar Aug 21 '23 17:08 mcfadden8

@mcfadden8 Did that pull request sufficiently fix this?

nselliott avatar Dec 12 '23 21:12 nselliott

@nselliott - Yes. There is more information provided here: https://rzlc.llnl.gov/gitlab/umpire/umpire/-/issues/12

mcfadden8 avatar Dec 14 '23 21:12 mcfadden8