LJA icon indicating copy to clipboard operation
LJA copied to clipboard

child process crashed

Open tcb72 opened this issue 2 years ago • 8 comments

Using latest version of LJA (git cloned, cd into LJA, compiled with cmake version 3.18.2, gcc 8.2.0) on a Linux-based cluster, it looks like it runs all the way until it attempts to uncompress homopolymers. The last few lines of the job log file:

00:06:31 161Mb  INFO: Exporting remaining active transitions
00:06:31 161Mb  INFO: Export to Dot
00:06:31 161Mb  INFO: Export to GFA and compressed contigs
00:06:33 210Mb  INFO: Finished repeat resolution
00:06:33 2Mb  INFO: Performing polishing and homopolymer uncompression
00:06:33 7Mb  INFO: Aligning reads back to assembly
00:06:38 1.1Gb  INFO: Finished alignment.
00:06:38 1.1Gb  INFO: Printing alignments to "lja_assembly/uncompressing/alignments.txt"
00:06:39 1.5Gb  INFO: Reading and processing initial reads from ["hifi_data/hifi_reads.fasta"]
00:07:43 2.8Gb  INFO: Uncompressing homopolymers in contigs
Child process crashed

And the last few lines of dbg.log

00:07:31 2.8Gb  TRACE: Processed 230000 original reads
00:07:31 2.8Gb  TRACE: Processed 231000 original reads
00:07:31 2.8Gb  TRACE: Processed 232000 original reads
00:07:31 2.8Gb  TRACE: Processed 233000 original reads
00:07:31 2.8Gb  TRACE: Processed 234000 original reads
00:07:31 2.8Gb  TRACE: Processed 235000 original reads
00:07:31 2.8Gb  TRACE: Processed 236000 original reads
00:07:43 2.8Gb  TRACE: Processed final batch of 22540 compressed reads
00:07:43 2.8Gb  INFO: Uncompressing homopolymers in contigs
00:07:43 2.8Gb  TRACE: Generating consensus for contig 104

core dump

Core was generated by `LJA/bin/lja -t 28 -o lja_assembly --reads hifi_data/hifi_reads_aligned_to_refer'.
Program terminated with signal 4, Illegal instruction.
#0  0x000000000057e4a6 in ?? ()
(gdb) bt
#0  0x000000000057e4a6 in ?? ()
#1  0x0000000000e90e10 in ?? ()
#2  0x00000000000000c9 in ?? ()
#3  0x000000000000001c in ?? ()
#4  0x0000000000555f79 in ?? ()
#5  0x00000000000000bf in ?? ()
#6  0x000000000002ef71 in ?? ()
#7  0x000000000255e930 in ?? ()
#8  0x00002aab18175390 in ?? ()
#9  0x00007fffffffa590 in ?? ()
#10 0x000000000255e880 in ?? ()
#11 0x00007fffffff8a80 in ?? ()
#12 0x00000000000000c9 in ?? ()
#13 0x00000000000869a5 in ?? ()
#14 0x00002aab1807d550 in ?? ()
#15 0x00002aab1814c9b0 in ?? ()
#16 0x00002aab18157aa0 in ?? ()
#17 0x00002aab18157aa0 in ?? ()
#18 0x0000000000000014 in ?? ()
#19 0x0000000000000000 in ?? ()

I'm using the following script (the genome is haploid:)

LJA/bin/lja -t 28 -o lja_assembly --reads hifi_data/hifi_reads.fasta

This doesn't seem to be a reads-specific error, as the same error in the same place occurred on two separate genomes. I'm not sure what other information I can give to help -- the log files aren't too helpful for troubleshooting.

I don't think it's a memory issue either because when I look at the job information after the fact, it says exit status 1. Usually OOM errors are exit status 137 (and I set memory to 300 Gb, so it shouldn't be an issue.)

tcb72 avatar Feb 07 '22 22:02 tcb72

Hi! Thank you for your interest in LJA and sorry for the crash. Unfortunately this is the first time I see such problem so I can not provide a quick fix. Would it be possible for you to share your data with me?

AntonBankevich avatar Feb 07 '22 23:02 AntonBankevich

Unfortunately I cannot, and I don't think it would be any help because this is happening with HiFi reads of different genomes. I'm 100% confident the FASTA files that contain the CCS reads are not corrupted or anything -- it works running through hifiasm, hicanu, Flye, etc.

However, using gdb I was able to parse the core dump file a bit more and will help troubleshoot:

Program terminated with signal 4, Illegal instruction.
#0  0x000000000057e4a6 in _M_create_storage (__n=<optimized out>, this=<optimized out>)
    at /LJA/src/lib/spoa/src/graph.cpp:65
65	Graph::Graph()

And the backtrace:

#0  0x000000000057e4a6 in _M_create_storage (__n=<optimized out>, this=<optimized out>)
    at /LJA/src/lib/spoa/src/graph.cpp:65
#1  _Vector_base (__a=..., __n=<optimized out>, this=<optimized out>) at /cm/local/apps/gcc/8.2.0/include/c++/8.2.0/bits/stl_vector.h:260
#2  vector (__n=256, __a=..., __value=<optimized out>, this=0x2aaaae7afd78) at /cm/local/apps/gcc/8.2.0/include/c++/8.2.0/bits/stl_vector.h:429
#3  spoa::Graph::Graph() () at/LJA/src/lib/spoa/src/graph.cpp:73
#4  0x0000000000555f79 in MSAConsensus (this=0x255e880, logger=..., s=...)
    at /LJA/src/projects/polishing/homopolish.cpp:173
#5  ContigInfo::GenerateConsensus[abi:cxx11](logging::Logger&) (this=<optimized out>, logger=...)
    at /LJA/src/projects/polishing/homopolish.cpp:257
#6  0x00002aaaab5812de in ?? () from /cm/local/apps/gcc/8.2.0/lib64/libgomp.so.1
#7  0x00002aaaab9b8e25 in start_thread () from /lib64/libpthread.so.0
#8  0x00002aaaabccbbad in __opensock () from /lib64/libc.so.6
#9  0x0000000000000000 in ?? ()

And the detailed backtrace:

#0  0x000000000057e4a6 in _M_create_storage (__n=<optimized out>, this=<optimized out>)
    at /LJA/src/lib/spoa/src/graph.cpp:65
No locals.
#1  _Vector_base (__a=..., __n=<optimized out>, this=<optimized out>) at /cm/local/apps/gcc/8.2.0/include/c++/8.2.0/bits/stl_vector.h:260
No locals.
#2  vector (__n=256, __a=..., __value=<optimized out>, this=0x2aaaae7afd78) at /cm/local/apps/gcc/8.2.0/include/c++/8.2.0/bits/stl_vector.h:429
No locals.
#3  spoa::Graph::Graph() () at /LJA/src/lib/spoa/src/graph.cpp:73
No locals.
#4  0x0000000000555f79 in MSAConsensus (this=0x255e880, logger=..., s=...)
    at /LJA/src/projects/polishing/homopolish.cpp:173
        alignment_engine = {_M_t = {
            _M_t = {<std::_Tuple_impl<0, spoa::AlignmentEngine*, std::default_delete<spoa::AlignmentEngine> >> = {<std::_Tuple_impl<1, std::default_delete<spoa::AlignmentEngine> >> = {<std::_Head_base<1, std::default_delete<spoa::AlignmentEngine>, true>> = {<std::default_delete<spoa::AlignmentEngine>> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, spoa::AlignmentEngine*, false>> = {
                  _M_head_impl = 0x2aab1807d550}, <No data fields>}, <No data fields>}}}
        only_good = <optimized out>
        pref_remove = <optimized out>
        suf_remove = <optimized out>
        med_len = <optimized out>
        graph = {num_codes_ = 0, coder_ = {<std::_Vector_base<int, std::allocator<int> >> = {
              _M_impl = {<std::allocator<int>> = {<__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>},
                _M_start = 0x2aab18159460, _M_finish = 0x0, _M_end_of_storage = 0x2aab18159860}}, <No data fields>},
          decoder_ = {<std::_Vector_base<int, std::allocator<int> >> = {
              _M_impl = {<std::allocator<int>> = {<__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>},
                _M_start = 0x2aab1815cd60, _M_finish = 0x1615, _M_end_of_storage = 0x1615}}, <No data fields>},
          sequences_ = {<std::_Vector_base<spoa::Graph::Node*, std::allocator<spoa::Graph::Node*> >> = {
              _M_impl = {<std::allocator<spoa::Graph::Node*>> = {<__gnu_cxx::new_allocator<spoa::Graph::Node*>> = {<No data fields>}, <No data fields>}, _M_start = 0x100000, _M_finish = 0x2aab18162290, _M_end_of_storage = 0x1eca}}, <No data fields>},
          nodes_ = {<std::_Vector_base<std::unique_ptr<spoa::Graph::Node, std::default_delete<spoa::Graph::Node> >, std::allocator<std::unique_ptr<spoa::Graph::Node, std::default_delete<spoa::Graph::Node> > > >> = {
              _M_impl = {<std::allocator<std::unique_ptr<spoa::Graph::Node, std::default_delete<spoa::Graph::Node> > >> = {<__gnu_cxx::new_allocator<std::unique_ptr<spoa::Graph::Node, std::default_delete<spoa::Graph::Node> > >> = {<No data fields>}, <No data fields>}, _M_start = 0x1eca,
                _M_finish = 0x54414741434154, _M_end_of_storage = 0x2aaaae7afde0}}, <No data fields>},
          edges_ = {<std::_Vector_base<std::unique_ptr<spoa::Graph::Edge, std::default_delete<spoa::Graph::Edge> >, std::allocator<std::unique_ptr<spoa::Graph::Edge, std::default_delete<spoa::Graph::Edge> > > >> = {
              _M_impl = {<std::allocator<std::unique_ptr<spoa::Graph::Edge, std::default_delete<spoa::Graph::Edge> > >> = {<__gnu_cxx::new_allocator<std::unique_ptr<spoa::Graph::Edge, std::default_delete<spoa::Graph::Edge> > >> = {<No data fields>}, <No data fields>}, _M_start = 0x0,
                _M_finish = 0x0, _M_end_of_storage = 0x54415441435441}}, <No data fields>},
          rank_to_node_ = {<std::_Vector_base<spoa::Graph::Node*, std::allocator<spoa::Graph::Node*> >> = {
              _M_impl = {<std::allocator<spoa::Graph::Node*>> = {<__gnu_cxx::new_allocator<spoa::Graph::Node*>> = {<No data fields>}, <No data fields>}, _M_start = 0x7fffffff8f80, _M_finish = 0x3ee4, _M_end_of_storage = 0x7fffffff90c0}}, <No data fields>},
          consensus_ = {<std::_Vector_base<spoa::Graph::Node*, std::allocator<spoa::Graph::Node*> >> = {
              _M_impl = {<std::allocator<spoa::Graph::Node*>> = {<__gnu_cxx::new_allocator<spoa::Graph::Node*>> = {<No data fields>}, <No data fields>}, _M_start = 0x3ee4, _M_finish = 0x7fffffff9720, _M_end_of_storage = 0x198ca0}}, <No data fields>}}
        cov = <optimized out>
        all_len = <optimized out>
        good_cons = <optimized out>
        coverages = {<std::_Vector_base<unsigned int, std::allocator<unsigned int> >> = {
            _M_impl = {<std::allocator<unsigned int>> = {<__gnu_cxx::new_allocator<unsigned int>> = {<No data fields>}, <No data fields>},
---Type <return> to continue, or q <return> to quit---
              _M_start = 0x2aab1814c9b0, _M_finish = 0x2aab18157aa0, _M_end_of_storage = 0x2aab18157aa0}}, <No data fields>}
        consensus = {static npos = 18446744073709551615,
          _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
            _M_p = 0x2aab1814aba0 "\350\a"}, _M_string_length = 5662, {
            _M_local_buf = "\000\036\000\000\000\000\000\000\200\217\377\377\377\177\000", _M_allocated_capacity = 7680}}
#5  ContigInfo::GenerateConsensus[abi:cxx11](logging::Logger&) (this=<optimized out>, logger=...)
    at /LJA/src/projects/polishing/homopolish.cpp:257
        start_pos = 551333
        consensus = {static npos = 18446744073709551615,
          _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x0},
          _M_string_length = 7882, {_M_local_buf = "\001\000\000\000 \000\000\000\240\260\001\030\253*\000",
            _M_allocated_capacity = 137438953473}}
        i = 191
        ss = <incomplete type>
        debug = <incomplete type>
        total_count = <optimized out>
        cur_complex_ind = <optimized out>
        consensus = {static npos = 18446744073709551615,
          _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
            _M_p = 0x2aaaae7afbe0 "\017\374z\256\252*"}, _M_string_length = 38, {_M_local_buf = "`\315\025\030\253*\000\000L(ū\252*\000",
            _M_allocated_capacity = 46914331856224}}
#6  0x00002aaaab5812de in ?? () from /cm/local/apps/gcc/8.2.0/lib64/libgomp.so.1
No symbol table info available.
#7  0x00002aaaab9b8e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#8  0x00002aaaabccbbad in __opensock () from /lib64/libc.so.6
No symbol table info available.
#9  0x0000000000000000 in ?? ()
No symbol table info available.

tcb72 avatar Feb 07 '22 23:02 tcb72

Hi,

I have the same error or kind of error, after 4 hours ... 03:06:01 5Gb INFO: Export to GFA and compressed contigs 03:06:58 5.7Gb INFO: Finished repeat resolution 03:07:07 2Mb INFO: Performing polishing and homopolymer uncompression 03:07:42 0.5Gb INFO: Aligning reads back to assembly 03:17:28 20.7Gb INFO: Finished alignment. 03:17:28 20.7Gb INFO: Printing alignments to "lja_ec/uncompressing/alignments.txt" 03:18:04 35.3Gb INFO: Reading and processing initial reads from ["reads.fa"] 04:06:51 36.9Gb INFO: Uncompressing homopolymers in contigs Child process crashed

chklopp avatar Mar 02 '22 13:03 chklopp

Hi chklopp! Sorry to hear that. Unfortunately we still can not reproduce this problem on our data. Maybe you could share your data with me? it would be really helpful.

AntonBankevich avatar Mar 02 '22 16:03 AntonBankevich

No sorry, it is not my data and it is was just to test your code. We already have a good assembly for it with hifiasm.

chklopp avatar Mar 02 '22 16:03 chklopp

I also have seen this error a few times, but it seems pretty nondeterministic. For example this run crashed

04:32:08 73.9Gb  INFO: Applying changes to the graph
05:15:48 105Gb  INFO: Collecting and storing read suffixes
05:36:48 108.3Gb  INFO: Correcting dinucleotide errors in reads
Child process crashed

while this one made it past (currently still running)

02:54:37 70.8Gb  INFO: Applying changes to the graph
03:34:20 102.6Gb  INFO: Collecting and storing read suffixes
03:51:56 108.8Gb  INFO: Correcting dinucleotide errors in reads
05:50:43 108.8Gb  INFO: Applying corrections to reads
06:00:06 109.2Gb  INFO: Applied correction to 302606 reads
06:00:07 109.2Gb  INFO: Corrected 302606 dinucleotide sequences
06:00:07 109.2Gb  INFO: Marking reliable edges
06:00:29 109.2Gb  INFO: Marked 1017912 edges in 248165 paths as reliable
06:00:30 109.2Gb  INFO: Correcting low covered regions in reads with K = 800
08:31:43 110.4Gb  INFO: Applying corrections to reads
08:59:50 111.5Gb  INFO: Applied correction to 982484 reads
08:59:50 111.5Gb  INFO: Corrected low covered regions in 982484 reads with K = 800
08:59:50 111.5Gb  INFO: Applying changes to the graph
09:43:54 137.9Gb  INFO: Marking reliable edges
09:44:02 137.9Gb  INFO: Marked 116356 edges in 37311 paths as reliable
09:44:02 137.9Gb  INFO: Correcting low covered regions in reads with K = 2000
11:43:52 137.9Gb  INFO: Applying corrections to reads
11:57:48 137.9Gb  INFO: Applied correction to 101111 reads
11:57:49 137.9Gb  INFO: Corrected low covered regions in 101111 reads with K = 2000
11:57:49 137.9Gb  INFO: Applying changes to the graph
12:53:23 156.8Gb  INFO: Correcting dinucleotide errors in reads

I've seen this a few times where literally re-running the same command will sometimes crash with the Child process crashed error and at different times. This is on a HPC, so could be different nodes with different CPUs etc. I haven't ever made it past the first error correction (either due to crash or a 24 hour wall limit), so am hoping the current 120h job will make it further. It is trio binned data for this sample, so I can share the fastq if wanted.

ASLeonard avatar Mar 04 '22 08:03 ASLeonard

Hi @ASLeonard . Your issue is different from what others have in this thread so I created a new issue and I will work on it separately.

AntonBankevich avatar Mar 21 '22 19:03 AntonBankevich

Hi @AntonBankevich. I also have the "child process crashed " error when I run LJA with 100X HiFi reads of human. I report as following:

My command is:

export PATH={dir_lja}:$PATH
lja -t 48 --diploid -o 02_LJA/child.HiFi.LJA_default-2.0 --reads /data/DATA/RAWDATA/HiFi/fastqs/m64154_210327_091530.ccs.fq.gz --reads /data/DATA/RAWDATA/HiFi/fastqs/m64154_210328_153009.ccs.fq.gz --reads /data/DATA/RAWDATA/HiFi/fastqs/m64154_210403_042108.ccs.fq.gz --reads /data/DATA/RAWDATA/HiFi/fastqs/m64236_210323_112137.ccs.fq.gz --reads /data/DATA/RAWDATA/HiFi/fastqs/m64236_210330_051441.ccs.fq.gz --reads /data/DATA/RAWDATA/HiFi/fastqs/m64154_210318_093818.ccs.fq.gz --reads /data/DATA/RAWDATA/HiFi/fastqs/m64154_210322_170722.ccs.fq.gz --reads /data/DATA/RAWDATA/HiFi/fastqs/m64154_210323_233537.ccs.fq.gz --reads /data/DATA/RAWDATA/HiFi/fastqs/m64154_210325_060339.ccs.fq.gz --reads /data/DATA/RAWDATA/HiFi/fastqs/m64154_210401_215258.ccs.fq.gz      

part of log:

46:09:26 56.3Gb  INFO: Collected 863 overlaps. Looking for unique overlaps
46:09:27 56.3Gb  INFO: Collected 44 unique overlaps.
46:09:27 56.3Gb  INFO: Adding new connections to the graph
46:15:01 56.3Gb  INFO: Correcting tips using reliable edge marks
46:16:47 56.3Gb  INFO: Applying corrections to reads
46:16:48 56.3Gb  INFO: Applied correction to 114731 reads
46:16:57 56.3Gb  INFO: Applying changes to the graph
46:25:31 56.3Gb  INFO: Could not correct 181425 reads. They will be removed.
46:25:32 56.3Gb  INFO: Uncorrected reads were removed.
46:25:32 56.3Gb  INFO: Applying changes to the graph
Child process crashed

Please see the details from following two log files. std.log dbg.log.zip

Best, Peng

PengJia6 avatar Sep 02 '22 01:09 PengJia6