MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

SAM output format @SQ headers

Open schorlton-bugseq opened this issue 11 months ago • 2 comments

Thanks for your hard work on this tool!

I noticed when using easy-search with v17-b804f and setting --format-mode 1 (SAM output), the @SQ header lines of the SAM file are only populated for reference sequences which had alignments. While this is perfectly valid, by dropping the reference sequences without alignments, it precludes and complicates some downstream analyses. For example, samtools depth -aa doesn't know about the reference sequences without alignments, and samtools cat messes up the reference sequence names for two SAMs with different headers, even though the reference file may have been the same to generate them.

Would it be possible to output the @SQ lines for all lines in the input sequence when outputting in SAM format? Thanks for your consideration!

schorlton-bugseq avatar Jan 31 '25 01:01 schorlton-bugseq

Could you give me an example where this happens? I am not sure I understand whats wrong

milot-mirdita avatar Feb 06 '25 10:02 milot-mirdita

Sure! This is a mock example to show, I understand it is somewhat ridiculous.

>query
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>target
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
mmseqs easy-search --format-mode 1 --search-type 2 query.fna target.fna result.sam tmp
cat result.sam

@HD	VN:1.4	SO:queryname
samtools sort -o sorted.bam result.sam
samtools index sorted.bam
samtools depth -aa sorted.bam # yields empty result

If result.sam looked like:

@HD	VN:1.4	SO:queryname
@SQ	SN:target	LN:77

(and possibly even had the unmapped read!), the samtools depth from above would have produced output.

schorlton-bugseq avatar Feb 06 '25 16:02 schorlton-bugseq