BamHash
BamHash copied to clipboard
FASTQ read name/description when computing hash
Hi,
I have a question regarding how you use the FASTQ description field to calculate the hash. The question arises because I've been doing some tests and I only manage to get the same md5 using the -R
option in both the BAM file and the original FASTQ files.
More details following:
I've trimmed a FASTQ sample for testing purposes, the reads look like this:
~> zcat ../data/NA12878_trimmed_1.fastq.gz | head -n 4
@ERR194147.1 HSQ1004:134:C0D8DACXX:1:1104:3874:86238/1
GGTTCCTACTTCAGGGTCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACG
+
CC@FFFFFHHHHHJJJFHIIJJJJJJIHJIIJJJJJJJJIIGIJJIJJJIJJJIJIJJJJJJJJJJIJHHHHFFFDEEEEEEEEDDDCDDEEDDDDDDDDD
When, after the analysis, I run bamhash_checksum_fastq
in the FASTQ files and bamhash_checksum_bam
in the resulting BAM file I get different md5's:
~> bamhash_checksum_fastq ../data/NA12878_trimmed_*
a05de49644a0fb5d 10000
~> bamhash_checksum_bam final/NA12878_trimmed/NA12878_trimmed-ready.bam
d4d5ece0f619d83d 20000
If I convert the BAM file back to FASTQ I realised that the FASTQ read description disappears, i.e:
~> samtools fastq final/NA12878_trimmed/NA12878_trimmed-ready.bam | head -n 4
@ERR194147.6389/2
CATCGGATTTTTGTTTTTTTTGTTTTGGGTGGGGGGGGTTGGTGGGGTTGTGTGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGTGGTTGG
+
11++40+2)2,+)2:3AA8)))00)))((('''''&&&))&&(((&&&&&&(((++(&05;BB7@>B@BDBBDDB>BDD@@3>9<&5-&&&&&&)&&)+(&
Is that description after the readname used to calculate the hash? I'm pretty confident that this is the problem, since if I run BamHash with -R
it does return the expected result:
~> bamhash_checksum_fastq -R ../data/NA12878_trimmed_1.fastq.gz ../data/NA12878_trimmed_2.fastq.gz
f4524c00c70e9b83 10000
~> bamhash_checksum_bam -R final/NA12878_trimmed/NA12878_trimmed-ready.bam
f4524c00c70e9b83 20000
Thanks for your help!
@guillermo-carrasco could you look whether you get any inconsistencies between the first in pair/second in pair BAM flag and the endings of "/0" or "/1" in any of the FASTQ files? This is what gave us some headaches.