SAM output format @SQ headers
Thanks for your hard work on this tool!
I noticed when using easy-search with v17-b804f and setting --format-mode 1 (SAM output), the @SQ header lines of the SAM file are only populated for reference sequences which had alignments. While this is perfectly valid, by dropping the reference sequences without alignments, it precludes and complicates some downstream analyses. For example, samtools depth -aa doesn't know about the reference sequences without alignments, and samtools cat messes up the reference sequence names for two SAMs with different headers, even though the reference file may have been the same to generate them.
Would it be possible to output the @SQ lines for all lines in the input sequence when outputting in SAM format? Thanks for your consideration!
Could you give me an example where this happens? I am not sure I understand whats wrong
Sure! This is a mock example to show, I understand it is somewhat ridiculous.
>query
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>target
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
mmseqs easy-search --format-mode 1 --search-type 2 query.fna target.fna result.sam tmp
cat result.sam
@HD VN:1.4 SO:queryname
samtools sort -o sorted.bam result.sam
samtools index sorted.bam
samtools depth -aa sorted.bam # yields empty result
If result.sam looked like:
@HD VN:1.4 SO:queryname
@SQ SN:target LN:77
(and possibly even had the unmapped read!), the samtools depth from above would have produced output.