Samtools merge on .sam files? Inconsistent input-order-dependant behaviour
samtools 1.13 Using htslib 1.13+ds running on: ubuntu:22.04 container
I have two .sam files which I want to merge. They are name-sorted.
Q1: There is no documented behaviour for samtools merge on .sam files. Documentation only mentions .bam files. Is samtools merge supposed to be used for .sam files too?
Q2: Assuming samtools merge can be used on samfiles, I noticed that the following four commands all yield different files of different sizes:
-
`samtools merge -n -o merged.sam file1.sam file2.sam` -
`samtools merge -n -o rev_merged.sam file1.sam file2.sam` -
`samtools merge -n -o wildcard_merged.sam file*.sam` -
`samtools view -Sh --no-PG file1.sam > viewmerged.sam; samtools view -S file2.sam >> viewmerged.sam`
Surprisingly, input order seems to play a role! Using wildcards again gives different results. For all of these options the sum of the sequence lines of the parent files are equal to the number of sequence lines in the merged files. The header lines however get reduced quite a bit. Can you explain this behavior and advise me on what to use? Thanks!
Most (but not all) of the samtools tools will read and write SAM, BAM or CRAM files. samtools merge will handle any of these file formats.
I'm not sure how the order of input files affects the header size. Do any of the files look incorrect?
The files look fine as far as I can assess that. Samtools can read them anyway. Still the differences are not just minor. For a ~10GB alignment file the differences are several hundred MB of header lines.
Can you count the different header tags and see which tags have been added?
Sorry for the wait. I think I must have gotten a bit confused when writing this issue. The line counts are all identical. Only merging the output of view cuts some @RG and @PG lines.
For merged.sam, rev_merged.sam and wildcard_merged.sam the line counts for lines starting with @, starting with <sample_id> and lines starting with read are all the same. I must apologize for my previous oversight.
For merging two .sam file with 6.8G and 1.1G respectively, the merged files are between 7.3G and 7.6G in size. Is this kind of a difference reproducible for you with any two alignment files?
I merged 74G and 81G sam files in both orders. The resulting files had only 6 bytes size difference out of 155G. I am not sure why you are seeing such a big difference in size.
Right, thanks for checking! I'll have to do more digging later. Ill close the issue for now. Again, I appreciate your responses.