ray
ray copied to clipboard
Crash while allocating memory
Hi
We are using the latest version of Ray, on 2TB RAM nodes and assembling a snake genome. Ray was compiled with GCC 5.1 and with the following make... make PREFIX=/afs/<your_preferred_install_directory> MAXKMERLENGTH=128 MPICXX=mpic++ HAVE_LIBZ=y MPI_IO=y
Everything worked fine except when running on this large dataset we get...::
Critical exception: The system is out of memory, returned NULL. Requested -2147483648 bytes of type RAY_MALLOC_TYPE_GRID_TABLE
Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.
mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[19389,1],0] Exit code: 42
This seems to be a memory issue, and we could detect that not all of the 2TB RAM was used. We did the following changes in Ray...
. Compilation was done using the intel compiler rather than the GNU compiler
i-compilers 15.0.2 and intelmpi 5.0.3
. I compiled the code with flag -mcmodel=medium in total...::
make PREFIX=/afs/<your_preferred_install_directory> MAXKMERLENGTH=128 MPICXX = mpiicpc
HAVE_LIBZ=y MPI_IO=y CXXFLAGS =' -O3 -std=c++98 -Wall -g -mcmodel=medium'
. Changed line 571 in RayPlatform/RayPlatform/structures/MyHashTable.h
size_t requiredBytes=sizeof(MyHashTableGroup<KEY,VALUE>)*(size_t)m_numberOfGroups;
. In RayPlatform/RayPlatform/memory/allocator.h
Added #include <stddef.h>
. In RayPlatform/RayPlatform/memory/allocator.h at line 28
void*__Malloc(size_t c,const char*description,bool show);
. In RayPlatform/RayPlatform/memory/allocator.cpp at line 36
void*__Malloc(size_t c,const char*description,bool show){
. In RayPlatform/RayPlatform/memory/allocator.cpp at line 56
printf("%s %i\t%s\t%zu bytes, ret\t%p\t%s\n",__FILE__,__LINE__,__func__,c,a,description);
For consistency perhaps we should not use size_t but rather uint64_t since I see that other part of the sourcecode are using it.
The assembly has nowadays, been running for 18 days, but does not generate any errors at least yet. Do you have any thoughts about this matter?
With kind regards Henric Zazzi