d4-format
d4-format copied to clipboard
Bug: off-by-one bug on bam reads
I'll make a genome of 30 bases, and my BAM has one read, 21 bases in length, that spans from bases 5-25 (1-based start) or 4-25 (0-based start):
██████FAKE_READ██████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1 2 3
123456789012345678901234567890
It has this as a BAM:
$ samtools view -h test.bam
@HD VN:1.5 SO:coordinate
@SQ SN:chr1 LN:30
@PG ID:samtools PN:samtools VN:1.19.2 CL:samtools view -h test.bam
FAKE_READ 0 chr1 5 1 21M * 0 0 * * AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:21 YT:Z:UU
It should be 4 bases depth 0, followed by 21 bases depth 1, followed by 5 bases at depth 0. See the samtools output below:
$ samtools depth -a test.bam
chr1 1 0
chr1 2 0
chr1 3 0
chr1 4 0
chr1 5 1
chr1 6 1
chr1 7 1
chr1 8 1
chr1 9 1
chr1 10 1
chr1 11 1
chr1 12 1
chr1 13 1
chr1 14 1
chr1 15 1
chr1 16 1
chr1 17 1
chr1 18 1
chr1 19 1
chr1 20 1
chr1 21 1
chr1 22 1
chr1 23 1
chr1 24 1
chr1 25 1
chr1 26 0
chr1 27 0
chr1 28 0
chr1 29 0
chr1 30 0
And just to provide a little more info, here are a couple bedtools operations:
$ bedtools bamtobed -i test.bam
chr1 4 25 FAKE_READ 1 +
$ printf "chr1\t0\t30\n" > chr1.bed
$ bedtools coverage -a test.bam -b chr1.bed
chr1 4 25 FAKE_READ 1 + 4 25 0,0,0 1 21, 0, 1 21 21 1.0000000
Everything looks as expected to me. Now I'll make the d4 file:
$ d4tools create -q=1 test.bam test.d4
$ d4tools view test.d4
chr1 0 4 0
chr1 4 26 1
chr1 26 30 0
Here it appears that the d4tools believes my read is 22 bases long, so the end coordinate is off by one, more specifically it's +1 more than it should be.