samtools icon indicating copy to clipboard operation
samtools copied to clipboard

Add "samtools checksum" command. Needs htslib#1850

Open jkbonfield opened this issue 1 year ago • 4 comments

This serves two purposes.

bamseqchksum

It's a replacement for biobambam's bamseqchksum tool (but about twice as fast), which is heavily used at Sanger. It should give compatible results with the -qv option.

This checksums sequence, name, quality and barcode tags in an order and orientation agnostic manner, to facilitate validation of no data loss between raw fastq (or unmapped crams) through alignment, markdup, sorting, and other processing operations to get to the final aligned bam/cram.

$ bamseqchksum inputformat=bam < ~/scratch/data/NA20800.chrom20.ILLUMINA.bwa.TSI.low_coverage.20101123.bam
###	set	count		b_seq	name_b_seq	b_seq_qual	b_seq_tags(BC,FI,QT,RT,TC)
all	all	5074609		37d3bb27	7cb87cff	31b22362	37d3bb27
all	pass	5074609		37d3bb27	7cb87cff	31b22362	37d3bb27
	all	0		1	1	1	1
	pass	0		1	1	1	1
ERR005381	all	333473		764d99c2	61cb8e25	2983721e	764d99c2
ERR005381	pass	333473		764d99c2	61cb8e25	2983721e	764d99c2
ERR005382	all	326713		51f5df68	49ac2949	8f70ed9	51f5df68
ERR005382	pass	326713		51f5df68	49ac2949	8f70ed9	51f5df68
ERR005383	all	356619		10d093ad	2d2878a	4b0b5812	10d093ad
ERR005383	pass	356619		10d093ad	2d2878a	4b0b5812	10d093ad
ERR005384	all	346758		3bd87b98	2d099332	533eb72d	3bd87b98
ERR005384	pass	346758		3bd87b98	2d099332	533eb72d	3bd87b98
ERR005385	all	351289		6928f20c	4efcdf7f	10e0ab02	6928f20c
ERR005385	pass	351289		6928f20c	4efcdf7f	10e0ab02	6928f20c
ERR005386	all	349519		4786000b	2b31a30f	7ed25ff5	4786000b
ERR005386	pass	349519		4786000b	2b31a30f	7ed25ff5	4786000b
ERR005387	all	331932		77fe880c	7609b384	19dd48a6	77fe880c
ERR005387	pass	331932		77fe880c	7609b384	19dd48a6	77fe880c
ERR005729	all	638777		705e0fe3	545e7116	767ea15f	705e0fe3
ERR005729	pass	638777		705e0fe3	545e7116	767ea15f	705e0fe3
ERR005730	all	668672		b820a01	791d771d	53cb5615	b820a01
ERR005730	pass	668672		b820a01	791d771d	53cb5615	b820a01
ERR005731	all	654943		17137ecb	64674261	41b826a9	17137ecb
ERR005731	pass	654943		17137ecb	64674261	41b826a9	17137ecb
ERR013025	all	715914		284721c5	513cad00	60c8ef16	284721c5
ERR013025	pass	715914		284721c5	513cad00	60c8ef16	284721c5

vs

samtools checksum ~/scratch/data/NA20800.chrom20.ILLUMINA.bwa.TSI.low_coverage.20101123.bam
# Checksum for file: /nfs/users/nfs_j/jkb/scratch/data/NA20800.chrom20.ILLUMINA.bwa.TSI.low_coverage.20101123.bam
# Aux tags:          BC,FI,QT,RT,TC
# BAM flags:         PAIRED,READ1,READ2

# Group    QC        count  flag+seq  +name     +qual     +aux      combined
all        all     5074609  37d3bb27  7cb87cff  31b22362  37d3bb27  37fece8e
ERR005730  all      668672  0b820a01  791d771d  53cb5615  0b820a01  05acae7f
ERR005729  all      638777  705e0fe3  545e7116  767ea15f  705e0fe3  1d95d784
ERR005386  all      349519  4786000b  2b31a30f  7ed25ff5  4786000b  26dcaaf2
ERR005381  all      333473  764d99c2  61cb8e25  2983721e  764d99c2  2e286b89
ERR005387  all      331932  77fe880c  7609b384  19dd48a6  77fe880c  43b22e93
ERR005382  all      326713  51f5df68  49ac2949  08f70ed9  51f5df68  469ad363
ERR005383  all      356619  10d093ad  02d2878a  4b0b5812  10d093ad  02a82ac4
ERR005731  all      654943  17137ecb  64674261  41b826a9  17137ecb  509aaa8a
ERR005384  all      346758  3bd87b98  2d099332  533eb72d  3bd87b98  0dd51d05
ERR013025  all      715914  284721c5  513cad00  60c8ef16  284721c5  3ffda5ab
ERR005385  all      351289  6928f20c  4efcdf7f  10e0ab02  6928f20c  2f0ec1ec

If we want the QC pass/fail, we can use -q for that:

samtools checksum /tmp/_.bam -q
# Checksum for file: /tmp/_.bam
# Aux tags:          BC,FI,QT,RT,TC
# BAM flags:         PAIRED,READ1,READ2

# Group    QC        count  flag+seq  +name     +qual     +aux      combined
all        all     1000000  095f615d  657c16e5  3aff0ac9  0aa9a779  7bca26fb
all        pass     999793  25141c11  09589cc0  0d0d8bf5  45dc936c  7f98ccb2
all        fail        207  6241cda7  0aad2788  7ff9617d  49005afd  15eb4515
1#9        all     1000000  095f615d  657c16e5  3aff0ac9  0aa9a779  7bca26fb
1#9        pass     999793  25141c11  09589cc0  0d0d8bf5  45dc936c  7f98ccb2
1#9        fail        207  6241cda7  0aad2788  7ff9617d  49005afd  15eb4515

Unlike bamseqchksum we report the fail as well as pass, but only if non-zero. (Unless -v is enabled where zero rows are reported, for easier parsing)

File format round trips

We also have a need to validate that conversion between various file types such as SAM to BAM to CRAM doesn't change anything, or at least nothing that we don't already know about (such as CIGAR =/X vs M).

This is the -a option, which is a catch-all for all the other necessary options to force secondary/supplementary, more columns, full BAM flags, all tags, file order and orientation validation, etc.

samtools checksum /tmp/_.bam -aq
# Checksum for file: /tmp/_.bam
# Aux tags:          *,cF
# BAM flags:         PAIRED,PROPER_PAIR,UNMAP,MUNMAP,REVERSE,MREVERSE,READ1,READ2,SECONDARY,QCFAIL,DUP,SUPPLEMENTARY

# Group    QC        count  flag+seq  +name     +qual     +aux      +chr/pos  +cigar    +mate     combined
all        all     1000000  27864879  5e8f3da7  6590d6a9  01127390  04b70599  554465c2  682d5104  63313cc6
all        pass     999793  6d7b3f0e  2c2e05f4  2b03a598  6c12d12f  19dc7110  784ee553  6f950f4c  19ea0ca5
all        fail        207  3ffdc02d  2664fc17  46925c74  37e85373  416248db  2b359631  23ff8de9  4093751a
1#9        all     1000000  27864879  5e8f3da7  6590d6a9  01127390  04b70599  554465c2  682d5104  63313cc6
1#9        pass     999793  6d7b3f0e  2c2e05f4  2b03a598  6c12d12f  19dc7110  784ee553  6f950f4c  19ea0ca5
1#9        fail        207  3ffdc02d  2664fc17  46925c74  37e85373  416248db  2b359631  23ff8de9  4093751a

TODO

I have a few more things to work on, so labelling it as draft for now, but this is to get feedback on more ideas.

  • ~~The order of read-groups should probably be alphanumeric sorted rather than "as it comes" out of the hash table.~~

  • ~~We could consider ways to combine checksum output files. So file1.chk and file2.chk could be combined together to give the same output that we would expect if we were to samtools merge or samtools cat the two BAMs together and checksum the combined data. This is a simple case of calling the sums_update function to merge CRCs again.~~

  • ~~Test data~~

jkbonfield avatar Oct 08 '24 11:10 jkbonfield

Hi James, thank you for this PR and functionality! Can you elaborate more on the logic implemented here? From our testing, there are quite a few cases to consider when comparing matched BAM and CRAM:

  • Ignore tags: "NM", "MD", and "RG"
  • Compare absolutely value of TLEN
  • For unmapped reads with non-zero mapping quality, CRAM discards the MAPQ value and CIGAR string- how do you handle this case?
  • There are also sometimes insert size differences (out by 1, e.g. 162 vs 163) between CRAM and BAM?
  • Flag values occasionally differ too, e.g. 67 vs 99 or 176 vs 177?

chris-rands avatar Oct 15 '24 17:10 chris-rands

Good questions, and I'm up for looking at test data that exhibits some remaining problems.

However generally when doing CRAM checking with samtools checksum -a (the default is order agnostic and only the primary data consistent from fastq onwards), it's also doing data sanitization prior to checksumming. Ie like samtools view -z all.

This corrects many of the erroneous BAM fields put there by buggy aligners, including MAPQ on unmapped data and CIGAR strings on unmapped data and some BAM flags. It's not likely to change TLEN or other less obvious BAM flags though. It depends on your pipeline I guess, but I think locally we use may use fixmate to correct a lot of the aligner bugs while the data is still name collated, which obscures a lot of the TLEN out by one errors or incorrectly assigned BAM flags.

Anyway, if you have cases the round tripping isn't working on them I'm all ears and I'm happy to help add extra rules for known round trip problems. For example there's a new "cigarx" sanitizer option that replaces = and X CIGAR ops with M, and another "cigar" sanitizer which collapsed neighbouring identical operations together (eg 2M3M is 5M) and removes 0 length operations. Bizarrely both of those turn up in the wild - again likely due to buggy upstream processing! It's been quite an eye opener realising just how many oddly encoded files there are. :)

jkbonfield avatar Oct 16 '24 08:10 jkbonfield

I thought I dealt with NM and MD already, but apparently not. You can do -t '*,NM,MD' to compute checksum of all tags bar NM and MD. Annoyingly this (the lack of knowledge of whether the input file had NM and/or MD) is something I've already fixed in the CRAM 4 prototype, but I can't see it becoming official for a long while sadly.

I can change the -a defaults to do this.

RG however shouldn't need to be removed. If RG is present, CRAM codes it as a number to reference the Nth RG record (just as BAM encodes RNAME as the Nth SQ record). If it finds RG >= 0 then on decode it converts back. This means it should always round trip provided the RG headers exist. I guess technically we could have a BAM file with an RG tag and no RG header, but I'd argue it's broken data again. Maybe this is something the sanitizer should fix. I'll explore more, but have you seen such files in the wild?

Part of me also thinks if there is known broken inputs in a non-common way (unlike MAPQ/CIGAR on unaligned data which is sadly very common due to bwa doing this), then it's not that unfair to require users to explicitly add extra options to filter out those mistakes. It at least brings to their attention a problem with their input data, which could be seen as a benefit.

jkbonfield avatar Oct 16 '24 08:10 jkbonfield

I think I'm ready for review on this new command. There may be a need to add more sanitizer options or recommend fixmate etc first for CRAM round trips, but if there are differences highlighted then perhaps that's valid and correct.

jkbonfield avatar Oct 17 '24 10:10 jkbonfield