Bug: Incorrect output v 0.3.10 in coverage of d4 vs summary.txt
Hi,
Tested with the singularity containers of mosdepth version 0.3.3 and 0.3.10, same result for both.
I run mosdepth on a cram file, with e.g. the options --d4 -c MT -f genome.fai input.cram, then for the MT chromosome the output shows a mean coverage of 127 in the d4 output file, when doing d4tools stat on the mosdepth.d4 file.
However, in the summary.txt file of mosdepth a mean coverage of 4201 is reported.
Generating the .per-base.bed.gz file in v0.3.3 and taking the average over the MT coverage using awk on column 4 of the mosdepth bed file also gives around 4200.
In the mosdepth d4 file:
MT 0 16569 127
In the mosdepth.summary.txt
chrom length bases mean min max
MT 16569 69615018 4201.52 168 4692
total 16569 69615018 4201.52 168 4692
For comparison, when running d4tools on the cram input file directly without using mosdepth:
MT 0 16569 4834.862755748687
Running it on the full genome also shows minor differences for each chromosome, e.g. coverage of 29-30 in the summary.txt, but around 26 in the mosdepth d4 output.
Is the --d4 option giving an incorrect output? It is a big difference between the summary file and the d4 file from mosdepth. I would be very grateful for any input on this.
I am having a look. Thanks for reporting
if it's simple and you could test on mosdepth version 0.3.11, that would be helpful
I haven't been able to reproduce this. If you can share a bam with just the chrM data, that would be a great help. The differences in the rest of the genome are likely due to how each tool handles overlapping paired-end reads.
if it's simple and you could test on mosdepth version 0.3.11, that would be helpful
Thank you for checking the issue.
I will not be able to test it there unfortunately. I can run using singularity or conda and the --d4 flag does not seem to be available for 0.3.11 at the moment.
It seems to happen for all of our short-read samples and part of our pacbio samples. I will generate an mt bam file and get back to you.
Looking into the code a bit, it seems that d4tools uses a primary table and a secondary table to store its data. For my data it chose 7 bits for the primary table (which makes the max possible ascii value to be 127), which will represent 99% of the coverage, and then other data will be put in the secondary table, which allows up to 80 bits worth of size.
This code in the d4tools nim file:
if v > max_value:
fd4.write_secondary(chrom, i, v)
else:
fd4.write_primary(chrom, i, v)
Can it be that when mosdepth makes a d4 file it does not use this logic? I can see a call to write, which seems to write in bulk to one table, which is where I assume this comes from? I am not so familair with nim, does it expect all values in arr to fit within the primary table bit-width? This would explain why the mt region, caps out at 127. In mosdepth.nim:
when defined(d4):
if use_d4:
fd4.write(target.name, 0, arr)
Hej again, just to comment on my promise to deliver a short-read mt bam file. The problem only exists on the whole bam file. When I only have the mt in there, d4tools picks a larger bitwidth when sampling the data, and then correct values are shown, so I will not be able to provide a subsampled file. For long-read data it looks like the data is overall more even, leading to 'sometimes' the correct coverage being shown.
My summary after digging into this:
- The 127 cap is a D4 format design decision, not a necessarily a mosdepth bug.
- mosdepth uses d4tools in a mode that writes bulk per-base depth arrays.
- If values exceed the chosen primary bit-width (e.g. 7 bits = 127 max), they should be written to the secondary table, but this fallback might not be triggered in bulk write mode.
- The cap happens because values over 127 don’t fit into the primary table, and if write() doesn't route them to the secondary table, they're clamped to 127.
- d4tools selects bit-width based on a sampling step, assuming typical genome coverage (~100×), which may not account for high-depth regions like mitochondrial DNA.
Suggested solutions:
- allow users to pick the bitwidth, e.g. by giving the max expected coverage
- implement secondary table writing as in the d4tools nim file displayed above.
- Use d4tools independently on the bed output of mosdepth.
Hi @peterpru , thanks for following up with a very thorough analysis. Can you clarify if this fix which you indicate:
if v > max_value:
fd4.write_secondary(chrom, i, v)
else:
fd4.write_primary(chrom, i, v)
would resolve the issue? And it's also not clear to me from your description if this issue is ever present in only d4tools. Is it? thanks again. Once I understand more, I'll certainly try to get the fix in.
Hi @peterpru , thanks for following up with a very thorough analysis. Can you clarify if this fix which you indicate:
if v > max_value: fd4.write_secondary(chrom, i, v) else: fd4.write_primary(chrom, i, v) would resolve the issue? And it's also not clear to me from your description if this issue is ever present in only d4tools. Is it? thanks again. Once I understand more, I'll certainly try to get the fix in.
Hi, apologies, this is a very late response. The 127 problem is not present in d4tools itself.
Regarding the fix, I am not 100% sure it will fix it, as I have not delved too deep into the code of d4tools or mosdepth, but as I understand it, d4tools stores all the values under e.g. 127 in table 1 and everything over in table 2. Mosdepth uses only the primary table, so all values higher than 127 get 127.