LJA icon indicating copy to clipboard operation
LJA copied to clipboard

Crash on dinucleotede error correction

Open AntonBankevich opened this issue 2 years ago • 1 comments

I also have seen this error a few times, but it seems pretty nondeterministic. For example this run crashed

04:32:08 73.9Gb  INFO: Applying changes to the graph
05:15:48 105Gb  INFO: Collecting and storing read suffixes
05:36:48 108.3Gb  INFO: Correcting dinucleotide errors in reads
Child process crashed

while this one made it past (currently still running)

02:54:37 70.8Gb  INFO: Applying changes to the graph
03:34:20 102.6Gb  INFO: Collecting and storing read suffixes
03:51:56 108.8Gb  INFO: Correcting dinucleotide errors in reads
05:50:43 108.8Gb  INFO: Applying corrections to reads
06:00:06 109.2Gb  INFO: Applied correction to 302606 reads
06:00:07 109.2Gb  INFO: Corrected 302606 dinucleotide sequences
06:00:07 109.2Gb  INFO: Marking reliable edges
06:00:29 109.2Gb  INFO: Marked 1017912 edges in 248165 paths as reliable
06:00:30 109.2Gb  INFO: Correcting low covered regions in reads with K = 800
08:31:43 110.4Gb  INFO: Applying corrections to reads
08:59:50 111.5Gb  INFO: Applied correction to 982484 reads
08:59:50 111.5Gb  INFO: Corrected low covered regions in 982484 reads with K = 800
08:59:50 111.5Gb  INFO: Applying changes to the graph
09:43:54 137.9Gb  INFO: Marking reliable edges
09:44:02 137.9Gb  INFO: Marked 116356 edges in 37311 paths as reliable
09:44:02 137.9Gb  INFO: Correcting low covered regions in reads with K = 2000
11:43:52 137.9Gb  INFO: Applying corrections to reads
11:57:48 137.9Gb  INFO: Applied correction to 101111 reads
11:57:49 137.9Gb  INFO: Corrected low covered regions in 101111 reads with K = 2000
11:57:49 137.9Gb  INFO: Applying changes to the graph
12:53:23 156.8Gb  INFO: Correcting dinucleotide errors in reads

I've seen this a few times where literally re-running the same command will sometimes crash with the Child process crashed error and at different times. This is on a HPC, so could be different nodes with different CPUs etc. I haven't ever made it past the first error correction (either due to crash or a 24 hour wall limit), so am hoping the current 120h job will make it further. It is trio binned data for this sample, so I can share the fastq if wanted.

Originally posted by @ASLeonard in https://github.com/AntonBankevich/LJA/issues/14#issuecomment-1058939752

AntonBankevich avatar Mar 21 '22 19:03 AntonBankevich

As an update, the longer job did finish and reached the end of LJA. I was checking the two logs, and they were identical up to the point of the crash, so nothing obvious why it would crash some times and finish fine on others. They did run on different nodes but both should be able to handle the same CPU instructions, so unlikely that was the cause.

ASLeonard avatar Mar 23 '22 10:03 ASLeonard