chromap icon indicating copy to clipboard operation
chromap copied to clipboard

Chromap Bed output are not compatible with SALSA2 input bed format ?

Open bbalog87 opened this issue 2 years ago • 3 comments

Hi,

I've tried Chromap for mapping Hi-C reads for scaffolding with SALSA2.

I generated a bed output with: chromap --preset hic -x index -r cfish.ref.fa -1 SL01_R1.fq -2 SL01_R2.fq --BED -o ctfish.aln.bed

However, the output bed file is not compatible with the expected bed format in SALSA2.

A workaround is to generate a SAM output with chromap --SAM first, then convert the sam/bam file to bed format as expected by SALSA2.

Best, Julien

bbalog87 avatar Nov 26 '21 14:11 bbalog87

What does SALSA2 require? What does its input BED look like?

lh3 avatar Nov 26 '21 15:11 lh3

SALSA2 expects a sorted (by reads name) bed file like this one:

ContigID         Start     End     ReadName/ID                                 XXX   Strand
ONT_Shasta_78    818203  818283  A00126:181:HMKLJDSX2:1:1101:10945:1000/1        60      +
ONT_Shasta_78    817912  817984  A00126:181:HMKLJDSX2:1:1101:10945:1000/1        60      -
ONT_Shasta_78    817835  817985  A00126:181:HMKLJDSX2:1:1101:10945:1000/2        60      +

Chromap bed file is something like the following:

readID                               chrom1     pos1    chrom2 pos2 strand1 strand2 pair_type
A00126:181:HMKLJDSX2:1:1645:16215:17644 0       43      0       528     +       -       UU
A00126:181:HMKLJDSX2:1:1525:4634:24893  0       68      0       40309   +       -       UU
A00126:181:HMKLJDSX2:1:2521:32723:14512 0       81      0       1316    +       +       UU
A00126:181:HMKLJDSX2:1:2373:10746:28463 0       130     0       537     +       -       UU
A00126:181:HMKLJDSX2:1:1319:10393:15890 0       150     0       7313    +       -       UU

bbalog87 avatar Nov 26 '21 15:11 bbalog87

As two Hi-C reads in a pair are usually aligned to positions that are far away on the reference genome, you have to use "--TagAlign", which allows you to output the mapping for each read in a pair separately in BED format, instead of "--BED", which output one mapping for the two reads in the pair. We will document this more carefully.

I just tried generate Hi-C mapping output using "--TagAlign" and found it worked on the test dataset I have. The only problem is that the mapping length and strands seem not fully correct as what they are in the pairs or SAM output. Hopefully, we will have a fix on this soon. Moreover, to be fast and memory efficient, Chromap currently will not output read names in BED/TagAlign format and we are not sure that the read names are ever used by other tools. But we can have some pseudo-names there if necessary at the moment and we plan to allow the output of read names in BED format output later.

haowenz avatar Nov 27 '21 15:11 haowenz