samtools icon indicating copy to clipboard operation
samtools copied to clipboard

Samtools merge on .sam files? Inconsistent input-order-dependant behaviour

Open RayHackett opened this issue 1 year ago • 3 comments

samtools 1.13 Using htslib 1.13+ds running on: ubuntu:22.04 container

I have two .sam files which I want to merge. They are name-sorted. Q1: There is no documented behaviour for samtools merge on .sam files. Documentation only mentions .bam files. Is samtools merge supposed to be used for .sam files too?

Q2: Assuming samtools merge can be used on samfiles, I noticed that the following four commands all yield different files of different sizes:

  •      `samtools merge -n -o merged.sam file1.sam file2.sam`
    
  •      `samtools merge -n -o rev_merged.sam file1.sam file2.sam`
    
  •      `samtools merge -n -o wildcard_merged.sam file*.sam`
    
  •      `samtools view -Sh --no-PG file1.sam > viewmerged.sam;  samtools view -S file2.sam >> viewmerged.sam`
    

Surprisingly, input order seems to play a role! Using wildcards again gives different results. For all of these options the sum of the sequence lines of the parent files are equal to the number of sequence lines in the merged files. The header lines however get reduced quite a bit. Can you explain this behavior and advise me on what to use? Thanks!

RayHackett avatar May 16 '24 16:05 RayHackett

Most (but not all) of the samtools tools will read and write SAM, BAM or CRAM files. samtools merge will handle any of these file formats.

I'm not sure how the order of input files affects the header size. Do any of the files look incorrect?

whitwham avatar May 16 '24 17:05 whitwham

The files look fine as far as I can assess that. Samtools can read them anyway. Still the differences are not just minor. For a ~10GB alignment file the differences are several hundred MB of header lines.

RayHackett avatar May 16 '24 20:05 RayHackett

Can you count the different header tags and see which tags have been added?

whitwham avatar May 23 '24 12:05 whitwham

Sorry for the wait. I think I must have gotten a bit confused when writing this issue. The line counts are all identical. Only merging the output of view cuts some @RG and @PG lines.

For merged.sam, rev_merged.sam and wildcard_merged.sam the line counts for lines starting with @, starting with <sample_id> and lines starting with read are all the same. I must apologize for my previous oversight.

For merging two .sam file with 6.8G and 1.1G respectively, the merged files are between 7.3G and 7.6G in size. Is this kind of a difference reproducible for you with any two alignment files?

RayHackett avatar May 27 '24 08:05 RayHackett

I merged 74G and 81G sam files in both orders. The resulting files had only 6 bytes size difference out of 155G. I am not sure why you are seeing such a big difference in size.

whitwham avatar May 30 '24 15:05 whitwham

Right, thanks for checking! I'll have to do more digging later. Ill close the issue for now. Again, I appreciate your responses.

RayHackett avatar May 31 '24 07:05 RayHackett