hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

hifiasm crashes with segfault on a toy Hi-C dataset

Open sidorov-si opened this issue 2 years ago • 8 comments

Dear hifiasm team,

I'm developing an nf-core module for hifiasm, and I've tested haplotype phasing with hifiasm using a toy set of PacBio HiFi reads and a small set of Hi-C reads (please find attached). The HiFi reads come from a child genome (GIAB's HG002, SRR10382244), and the Hi-C reads come from a normal human lung tissue (SRR13061060) and are selected so that they map to the HiFi reads.

When I run the following command with hifiasm v0.15.4-r343

time hifiasm \
    -o test.asm \
    --h1 SRR13061060_10000reads_mapped_1.fastq \
    --h2 SRR13061060_10000reads_mapped_2.fastq \
    SRR10382244_subset.fastq

in a conda env on my Mac, it fails in a short time with Segmentation fault: 11 (please see the full run log attached).

However, it still produces some output:

-rw-r--r--  1 sidoros  1934034978   4.4M  7 Jul 16:51 test.asm.hic.tlb.bin
-rw-r--r--  1 sidoros  1934034978   4.2K  7 Jul 16:51 test.asm.hic.p_ctg.lowQ.bed
-rw-r--r--  1 sidoros  1934034978   4.9K  7 Jul 16:51 test.asm.hic.p_ctg.noseq.gfa
-rw-r--r--  1 sidoros  1934034978   579K  7 Jul 16:51 test.asm.hic.p_ctg.gfa
-rw-r--r--  1 sidoros  1934034978   5.3K  7 Jul 16:51 test.asm.hic.p_utg.lowQ.bed
-rw-r--r--  1 sidoros  1934034978   5.8K  7 Jul 16:51 test.asm.hic.p_utg.noseq.gfa
-rw-r--r--  1 sidoros  1934034978   686K  7 Jul 16:51 test.asm.hic.p_utg.gfa
-rw-r--r--  1 sidoros  1934034978   5.3K  7 Jul 16:51 test.asm.hic.r_utg.lowQ.bed
-rw-r--r--  1 sidoros  1934034978   5.8K  7 Jul 16:51 test.asm.hic.r_utg.noseq.gfa
-rw-r--r--  1 sidoros  1934034978   686K  7 Jul 16:51 test.asm.hic.r_utg.gfa
-rw-r--r--  1 sidoros  1934034978    11K  7 Jul 16:51 test.asm.ovlp.reverse.bin
-rw-r--r--  1 sidoros  1934034978    44K  7 Jul 16:51 test.asm.ovlp.source.bin
-rw-r--r--  1 sidoros  1934034978   934K  7 Jul 16:51 test.asm.ec.bin

What could be the reason for the segfault?

Thank you, Slava

hifiasm_output.tar.gz SRR10382244_subset.fastq.gz SRR13061060_10000reads_mapped_1.fastq.gz SRR13061060_10000reads_mapped_2.fastq.gz hifiasm_run_log.txt

sidorov-si avatar Jul 07 '21 14:07 sidorov-si

Let me have a look at it. But for this example, the coverage is too low for assembly.

chhylp123 avatar Jul 07 '21 15:07 chhylp123

Thank you @chhylp123 ! In terms of coverage, do you mean the HiFi reads or Hi-C reads? HiFi reads I selected so that they map to the same contig, so I hoped that they could be assembled?

sidorov-si avatar Jul 07 '21 16:07 sidorov-si

The k-mer plot looks weird. The normal HiFi data should have a k-mer plot like: https://github.com/chhylp123/hifiasm/issues/49#issue-729106823

chhylp123 avatar Jul 08 '21 12:07 chhylp123

How do they look like? Maybe, it's just because these are only 204 HiFi reads mapping to a particular HG002 contig, and on the whole HG002 PacBio HiFi run the kmer profile would be different?

sidorov-si avatar Jul 08 '21 13:07 sidorov-si

Yes, I think so. So the coverage looks not enough. But hifiasm shouldn't crash even in this rare case, I will have a look at it. I just recommend you to have a try with enough coverage for testing.

chhylp123 avatar Jul 08 '21 14:07 chhylp123

@sidorov-si You can assembly the whole HG002 HiFi reads or just one chromosome, and then grep the HiFi reads in the *.p_utg.noseq.gfa for pipeline testing. I have try assembly a smallset HiFi reads (~200 reads), it can produce the assembly.

baozg avatar Jul 08 '21 14:07 baozg

Thank you @chhylp123 ! So, I'm using 204 reads that map to one conting from the whole HG002 assembly produced in your paper. How do you estimate the coverage?

sidorov-si avatar Jul 08 '21 15:07 sidorov-si

Sorry for the delay. This bug has been fixed in v0.15.5 (see: https://github.com/chhylp123/hifiasm/releases/tag/0.15.5).

chhylp123 avatar Jul 26 '21 04:07 chhylp123