pumi-pic
pumi-pic copied to clipboard
CSR Having Unexpectedly Large Memory Usage
I've been working on the cabmBuild
branch and noticed that we have some unexpected behavior while testing CSR
. A new version of the testing code, ps_combo.cpp
, was made to test larger amounts of data per particle, ps_combo32.cpp
(which uses a size 32 array of doubles for each particle instead of the original size 3 array). This is linked here.
During comparative testing for CabM
on AiMOS, it was found that CSR
ceases due to an out of memory
error at 50,000 elements and 50,000,000 particles. The error message is included below:
Test Command:
./ps_combo32 50000 50000000 1 -p 50 -n 1
Generating particle distribution with strategy: Uniform
Building CSR
Performing 100 iterations of rebuild on each structure
Beginning push on structure CSR
Beginning rebuild on structure CSR
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorMemoryAllocation): out of memory /gpfs/u/barn/MPFS/MPFSmttw/pumipic_CabM/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
Traceback functionality not available
[dcs044:159743] *** Process received signal ***
[dcs044:159743] Signal: Aborted (6)
[dcs044:159743] Signal code: (-6)
[dcs044:159743] [ 0] [0x7fff8ad704d8]
[dcs044:159743] [ 1] /usr/lib64/libc.so.6(abort+0x2b4)[0x7fff89412094]
[dcs044:159743] [ 2] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x7fff897a0644]
[dcs044:159743] [ 3] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(+0xab364)[0x7fff8979b364]
[dcs044:159743] [ 4] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x7fff8979b420]
[dcs044:159743] [ 5] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(__cxa_throw+0x80)[0x7fff8979b8e0]
[dcs044:159743] [ 6] ./ps_combo32(_ZN6Kokkos4Impl23throw_runtime_exceptionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xc4)[0x101aedc0]
[dcs044:159743] [ 7] ./ps_combo32(_ZN6Kokkos4Impl25cuda_internal_error_throwE9cudaErrorPKcS3_i+0x170)[0x101b0f40]
[dcs044:159743] [ 8] ./ps_combo32(_ZN6Kokkos4Impl23cuda_internal_safe_callE9cudaErrorPKcS3_i+0x60)[0x101b4128]
[dcs044:159743] [ 9] ./ps_combo32(_ZNK6Kokkos9CudaSpace8allocateEm+0x60)[0x101b6478]
[dcs044:159743] [10] ./ps_combo32(_ZN6Kokkos4Impl22SharedAllocationRecordINS_9CudaSpaceEvEC2ERKS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmPFvPNS1_IvvEEE+0x4c)[0x101b78a8]
[dcs044:159743] [11] ./ps_combo32(_ZN6Kokkos4ViewIPA32_dJNS_10LayoutLeftENS_6DeviceINS_4CudaENS_9CudaSpaceEEEEEC2IJNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEERKNS_4Impl12ViewCtorPropIJDpT_EEERKNSt9enable_ifIXntsrSK_11has_pointerES3_E4typeE+0x10c)[0x1016632c]
[dcs044:159743] [12] ./ps_combo32(_ZN7pumipic3CSRINS_11MemberTypesIJiA32_ddEEEN6Kokkos9CudaSpaceEE7rebuildENS4_4ViewIPiJNS4_6DeviceINS4_4CudaES5_EEEEESC_PPv+0x308)[0x1017f628]
[dcs044:159743] [13] ./ps_combo32(main+0x1800)[0x100a8e60]
[dcs044:159743] [14] /usr/lib64/libc.so.6(+0x25200)[0x7fff893f5200]
[dcs044:159743] [15] /usr/lib64/libc.so.6(__libc_start_main+0xc4)[0x7fff893f53f4]
[dcs044:159743] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node dcs044 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
However, both the SCS
and CabM
particle structures do not fail until our next set of tests at 75,000 elements and 75,000,000 particles. We investigated and attempted to run ps_combo32
again with the number of iterations at line 89 (originally 100) reduced to 1. ~~In this case, all three particle structures failed due to an out of memory
error at 75,000 elements and 75,000,000 particles. This leads me to suspect that there is some sort of large-scale memory error in CSR
or possibly the testing code.~~ (See Below Edit)
For reference, the set of tests we were running are in the file, test_largeE_largeP.sh
, located here (using the second commented-out call to ps_combo
for use on AiMOS).
EDIT: Upon further inspection, this does not seem to be a memory leak. However, it is the case that CSR
is using much more memory than expected. I've checked, and it seems that particles_on_process
is being calculated correctly, here. I ran some performance tests on CSR
using the Kokkos memory-usage tools, here with the test mpirun -np 1 ./ps_combo160 1000 1000000 1 -n 1
on a 6-GPU node on AiMOS. I found that, at their maximums, CabM
uses 331.2 MB and CSR
uses 470.8 MB. This is unexpected behavior because CabM
should be allocating more memory through the use of padding. I think I've tracked it down to the particle_info
temporary MTVs
in CSR::rebuild
, here, but I'm not sure how it could be allocating this much extra space.
UPDATE: The issue was found. Because CSR uses an MTVs
to store its particle data and continually makes and destroys them, these get
calls were leaving a few smart pointers to the original set of data. Thus, when rebuilding, CSR was using 3x the memory of ptcl_data
instead of just 2x. Currently, this has been fixed by enclosing these get
calls in a for loop, thus causing these smart pointers to go out-of-scope before the call to migrate
/rebuild
.
A general fix has been proposed and is currently underway whereby a second copy of ptcl_data
would be stored at all times for swapping purposes (like SCS) for both CSR and CabanaM.
Once CSR has its swapping implementation done, we could probably close this issue, although the issue is still technically there for cases in which CSR increases in size so that it triggers a full rebuild.