samtools
samtools copied to clipboard
Add "samtools checksum" command. Needs htslib#1850
This serves two purposes.
bamseqchksum
It's a replacement for biobambam's bamseqchksum tool (but about twice as fast), which is heavily used at Sanger. It should give compatible results with the -qv option.
This checksums sequence, name, quality and barcode tags in an order and orientation agnostic manner, to facilitate validation of no data loss between raw fastq (or unmapped crams) through alignment, markdup, sorting, and other processing operations to get to the final aligned bam/cram.
$ bamseqchksum inputformat=bam < ~/scratch/data/NA20800.chrom20.ILLUMINA.bwa.TSI.low_coverage.20101123.bam
### set count b_seq name_b_seq b_seq_qual b_seq_tags(BC,FI,QT,RT,TC)
all all 5074609 37d3bb27 7cb87cff 31b22362 37d3bb27
all pass 5074609 37d3bb27 7cb87cff 31b22362 37d3bb27
all 0 1 1 1 1
pass 0 1 1 1 1
ERR005381 all 333473 764d99c2 61cb8e25 2983721e 764d99c2
ERR005381 pass 333473 764d99c2 61cb8e25 2983721e 764d99c2
ERR005382 all 326713 51f5df68 49ac2949 8f70ed9 51f5df68
ERR005382 pass 326713 51f5df68 49ac2949 8f70ed9 51f5df68
ERR005383 all 356619 10d093ad 2d2878a 4b0b5812 10d093ad
ERR005383 pass 356619 10d093ad 2d2878a 4b0b5812 10d093ad
ERR005384 all 346758 3bd87b98 2d099332 533eb72d 3bd87b98
ERR005384 pass 346758 3bd87b98 2d099332 533eb72d 3bd87b98
ERR005385 all 351289 6928f20c 4efcdf7f 10e0ab02 6928f20c
ERR005385 pass 351289 6928f20c 4efcdf7f 10e0ab02 6928f20c
ERR005386 all 349519 4786000b 2b31a30f 7ed25ff5 4786000b
ERR005386 pass 349519 4786000b 2b31a30f 7ed25ff5 4786000b
ERR005387 all 331932 77fe880c 7609b384 19dd48a6 77fe880c
ERR005387 pass 331932 77fe880c 7609b384 19dd48a6 77fe880c
ERR005729 all 638777 705e0fe3 545e7116 767ea15f 705e0fe3
ERR005729 pass 638777 705e0fe3 545e7116 767ea15f 705e0fe3
ERR005730 all 668672 b820a01 791d771d 53cb5615 b820a01
ERR005730 pass 668672 b820a01 791d771d 53cb5615 b820a01
ERR005731 all 654943 17137ecb 64674261 41b826a9 17137ecb
ERR005731 pass 654943 17137ecb 64674261 41b826a9 17137ecb
ERR013025 all 715914 284721c5 513cad00 60c8ef16 284721c5
ERR013025 pass 715914 284721c5 513cad00 60c8ef16 284721c5
vs
samtools checksum ~/scratch/data/NA20800.chrom20.ILLUMINA.bwa.TSI.low_coverage.20101123.bam
# Checksum for file: /nfs/users/nfs_j/jkb/scratch/data/NA20800.chrom20.ILLUMINA.bwa.TSI.low_coverage.20101123.bam
# Aux tags: BC,FI,QT,RT,TC
# BAM flags: PAIRED,READ1,READ2
# Group QC count flag+seq +name +qual +aux combined
all all 5074609 37d3bb27 7cb87cff 31b22362 37d3bb27 37fece8e
ERR005730 all 668672 0b820a01 791d771d 53cb5615 0b820a01 05acae7f
ERR005729 all 638777 705e0fe3 545e7116 767ea15f 705e0fe3 1d95d784
ERR005386 all 349519 4786000b 2b31a30f 7ed25ff5 4786000b 26dcaaf2
ERR005381 all 333473 764d99c2 61cb8e25 2983721e 764d99c2 2e286b89
ERR005387 all 331932 77fe880c 7609b384 19dd48a6 77fe880c 43b22e93
ERR005382 all 326713 51f5df68 49ac2949 08f70ed9 51f5df68 469ad363
ERR005383 all 356619 10d093ad 02d2878a 4b0b5812 10d093ad 02a82ac4
ERR005731 all 654943 17137ecb 64674261 41b826a9 17137ecb 509aaa8a
ERR005384 all 346758 3bd87b98 2d099332 533eb72d 3bd87b98 0dd51d05
ERR013025 all 715914 284721c5 513cad00 60c8ef16 284721c5 3ffda5ab
ERR005385 all 351289 6928f20c 4efcdf7f 10e0ab02 6928f20c 2f0ec1ec
If we want the QC pass/fail, we can use -q for that:
samtools checksum /tmp/_.bam -q
# Checksum for file: /tmp/_.bam
# Aux tags: BC,FI,QT,RT,TC
# BAM flags: PAIRED,READ1,READ2
# Group QC count flag+seq +name +qual +aux combined
all all 1000000 095f615d 657c16e5 3aff0ac9 0aa9a779 7bca26fb
all pass 999793 25141c11 09589cc0 0d0d8bf5 45dc936c 7f98ccb2
all fail 207 6241cda7 0aad2788 7ff9617d 49005afd 15eb4515
1#9 all 1000000 095f615d 657c16e5 3aff0ac9 0aa9a779 7bca26fb
1#9 pass 999793 25141c11 09589cc0 0d0d8bf5 45dc936c 7f98ccb2
1#9 fail 207 6241cda7 0aad2788 7ff9617d 49005afd 15eb4515
Unlike bamseqchksum we report the fail as well as pass, but only if non-zero. (Unless -v is enabled where zero rows are reported, for easier parsing)
File format round trips
We also have a need to validate that conversion between various file types such as SAM to BAM to CRAM doesn't change anything, or at least nothing that we don't already know about (such as CIGAR =/X vs M).
This is the -a option, which is a catch-all for all the other necessary options to force secondary/supplementary, more columns, full BAM flags, all tags, file order and orientation validation, etc.
samtools checksum /tmp/_.bam -aq
# Checksum for file: /tmp/_.bam
# Aux tags: *,cF
# BAM flags: PAIRED,PROPER_PAIR,UNMAP,MUNMAP,REVERSE,MREVERSE,READ1,READ2,SECONDARY,QCFAIL,DUP,SUPPLEMENTARY
# Group QC count flag+seq +name +qual +aux +chr/pos +cigar +mate combined
all all 1000000 27864879 5e8f3da7 6590d6a9 01127390 04b70599 554465c2 682d5104 63313cc6
all pass 999793 6d7b3f0e 2c2e05f4 2b03a598 6c12d12f 19dc7110 784ee553 6f950f4c 19ea0ca5
all fail 207 3ffdc02d 2664fc17 46925c74 37e85373 416248db 2b359631 23ff8de9 4093751a
1#9 all 1000000 27864879 5e8f3da7 6590d6a9 01127390 04b70599 554465c2 682d5104 63313cc6
1#9 pass 999793 6d7b3f0e 2c2e05f4 2b03a598 6c12d12f 19dc7110 784ee553 6f950f4c 19ea0ca5
1#9 fail 207 3ffdc02d 2664fc17 46925c74 37e85373 416248db 2b359631 23ff8de9 4093751a
TODO
I have a few more things to work on, so labelling it as draft for now, but this is to get feedback on more ideas.
-
~~The order of read-groups should probably be alphanumeric sorted rather than "as it comes" out of the hash table.~~
-
~~We could consider ways to combine checksum output files. So file1.chk and file2.chk could be combined together to give the same output that we would expect if we were to
samtools mergeorsamtools catthe two BAMs together and checksum the combined data. This is a simple case of calling thesums_updatefunction to merge CRCs again.~~ -
~~Test data~~
Hi James, thank you for this PR and functionality! Can you elaborate more on the logic implemented here? From our testing, there are quite a few cases to consider when comparing matched BAM and CRAM:
- Ignore tags: "NM", "MD", and "RG"
- Compare absolutely value of TLEN
- For unmapped reads with non-zero mapping quality, CRAM discards the MAPQ value and CIGAR string- how do you handle this case?
- There are also sometimes insert size differences (out by 1, e.g. 162 vs 163) between CRAM and BAM?
- Flag values occasionally differ too, e.g. 67 vs 99 or 176 vs 177?
Good questions, and I'm up for looking at test data that exhibits some remaining problems.
However generally when doing CRAM checking with samtools checksum -a (the default is order agnostic and only the primary data consistent from fastq onwards), it's also doing data sanitization prior to checksumming. Ie like samtools view -z all.
This corrects many of the erroneous BAM fields put there by buggy aligners, including MAPQ on unmapped data and CIGAR strings on unmapped data and some BAM flags. It's not likely to change TLEN or other less obvious BAM flags though. It depends on your pipeline I guess, but I think locally we use may use fixmate to correct a lot of the aligner bugs while the data is still name collated, which obscures a lot of the TLEN out by one errors or incorrectly assigned BAM flags.
Anyway, if you have cases the round tripping isn't working on them I'm all ears and I'm happy to help add extra rules for known round trip problems. For example there's a new "cigarx" sanitizer option that replaces = and X CIGAR ops with M, and another "cigar" sanitizer which collapsed neighbouring identical operations together (eg 2M3M is 5M) and removes 0 length operations. Bizarrely both of those turn up in the wild - again likely due to buggy upstream processing! It's been quite an eye opener realising just how many oddly encoded files there are. :)
I thought I dealt with NM and MD already, but apparently not. You can do -t '*,NM,MD' to compute checksum of all tags bar NM and MD. Annoyingly this (the lack of knowledge of whether the input file had NM and/or MD) is something I've already fixed in the CRAM 4 prototype, but I can't see it becoming official for a long while sadly.
I can change the -a defaults to do this.
RG however shouldn't need to be removed. If RG is present, CRAM codes it as a number to reference the Nth RG record (just as BAM encodes RNAME as the Nth SQ record). If it finds RG >= 0 then on decode it converts back. This means it should always round trip provided the RG headers exist. I guess technically we could have a BAM file with an RG tag and no RG header, but I'd argue it's broken data again. Maybe this is something the sanitizer should fix. I'll explore more, but have you seen such files in the wild?
Part of me also thinks if there is known broken inputs in a non-common way (unlike MAPQ/CIGAR on unaligned data which is sadly very common due to bwa doing this), then it's not that unfair to require users to explicitly add extra options to filter out those mistakes. It at least brings to their attention a problem with their input data, which could be seen as a benefit.
I think I'm ready for review on this new command. There may be a need to add more sanitizer options or recommend fixmate etc first for CRAM round trips, but if there are differences highlighted then perhaps that's valid and correct.