Jellyfish
Jellyfish copied to clipboard
Memory allocation error due to large hash leaves behind orphaned generator processes
- Jelllyfish 2.3.0
- Clean install of Ubuntu 18.04 LTS
- gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Situation
When counting kmers and requesting a hash size that is larger than available memory, while using generators.
E.g. on a 4GB VM, the following will SIGABRT after attempting to allocate 8GB for the hash.
jellyfish count -C -m 24 -s 2G -g input.gen -o db.jf
Outcome
Jellyfish will fault with a SIGABRT, however the child process tree pertaining to the generators will become orphaned and subsequently adopted by systemd.
Console error.
terminate called after throwing an instance of 'jellyfish::large_hash::array_base<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, jellyfish::large_hash::unbounded_array<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, allocators::mmap> >::ErrorAllocation'
what(): Failed to allocate 8000000000 bytes of memory
For a single threaded generator, this leaves behind the following proceses, where PID 1775 is systemd.
UID PID PPID C STIME TTY TIME CMD
ubuntu 20020 20010 0 01:14 pts/0 00:00:05 zsh
ubuntu 52084 1775 0 12:32 pts/0 00:00:00 jellyfish count -C -m 24 -s 2G -g input.gen -o db.jf
ubuntu 52085 52084 0 12:32 pts/0 00:00:00 jellyfish count -C -m 24 -s 2G -g input.gen -o db.jf
ubuntu 52153 20020 0 12:34 pts/0 00:00:00 ps -f
Solution
Looking at the code, count_main.cc sets up a sigaction for SIGTERM. Adding an identical sigaction for SIGABRT results in one of these orphans being cleaned up but not both. It looks like someone more conversant with your codebase could improve this to properly clean up the generator_manager.
E.g.
UID PID PPID C STIME TTY TIME CMD
ubuntu 20020 20010 0 01:14 pts/0 00:00:05 zsh
ubuntu 52893 1775 0 12:58 pts/0 00:00:00 .local/bin/jellyfish count -C -m 24 -s 2G -g input.gen -o db.jf
ubuntu 52899 20020 0 12:58 pts/0 00:00:00 ps -f
I found that prctl could be used in Linux to enable clean-up of children when their parent process dies, but I later learned that this system function is not available under OSX.
Out of curiosity I took to OSX to see if I could reproduce the problem there. In testing on a non-virtual Mac with 32GB of physical memory and default swap of 11G, I found that I could not cause the allocation error even with a hash size of 2000G.
Therefore I suppose this error mode is unlikely and more a product of a tiny 4GB virtual machine.
I have made a fork of the codebase if you wish to see the small change to use prctl. Perhaps there is a means of doing this in OSX, but in my search I could not find any discussions.