mosdepth icon indicating copy to clipboard operation
mosdepth copied to clipboard

Bug: Incorrect output v 0.3.10 in coverage of d4 vs summary.txt

Open peterpru opened this issue 4 months ago • 8 comments

Hi,

Tested with the singularity containers of mosdepth version 0.3.3 and 0.3.10, same result for both.

I run mosdepth on a cram file, with e.g. the options --d4 -c MT -f genome.fai input.cram, then for the MT chromosome the output shows a mean coverage of 127 in the d4 output file, when doing d4tools stat on the mosdepth.d4 file.

However, in the summary.txt file of mosdepth a mean coverage of 4201 is reported.

Generating the .per-base.bed.gz file in v0.3.3 and taking the average over the MT coverage using awk on column 4 of the mosdepth bed file also gives around 4200.

In the mosdepth d4 file: MT 0 16569 127

In the mosdepth.summary.txt

chrom   length  bases   mean    min     max
MT      16569   69615018        4201.52 168     4692
total   16569   69615018        4201.52 168     4692

For comparison, when running d4tools on the cram input file directly without using mosdepth: MT 0 16569 4834.862755748687

Running it on the full genome also shows minor differences for each chromosome, e.g. coverage of 29-30 in the summary.txt, but around 26 in the mosdepth d4 output.

Is the --d4 option giving an incorrect output? It is a big difference between the summary file and the d4 file from mosdepth. I would be very grateful for any input on this.

peterpru avatar Sep 12 '25 11:09 peterpru

I am having a look. Thanks for reporting

brentp avatar Sep 16 '25 15:09 brentp

if it's simple and you could test on mosdepth version 0.3.11, that would be helpful

brentp avatar Sep 16 '25 16:09 brentp

I haven't been able to reproduce this. If you can share a bam with just the chrM data, that would be a great help. The differences in the rest of the genome are likely due to how each tool handles overlapping paired-end reads.

brentp avatar Sep 16 '25 17:09 brentp

if it's simple and you could test on mosdepth version 0.3.11, that would be helpful

Thank you for checking the issue.

I will not be able to test it there unfortunately. I can run using singularity or conda and the --d4 flag does not seem to be available for 0.3.11 at the moment.

It seems to happen for all of our short-read samples and part of our pacbio samples. I will generate an mt bam file and get back to you.

peterpru avatar Sep 17 '25 10:09 peterpru

Looking into the code a bit, it seems that d4tools uses a primary table and a secondary table to store its data. For my data it chose 7 bits for the primary table (which makes the max possible ascii value to be 127), which will represent 99% of the coverage, and then other data will be put in the secondary table, which allows up to 80 bits worth of size.

This code in the d4tools nim file:

if v > max_value:
    fd4.write_secondary(chrom, i, v)
else:
    fd4.write_primary(chrom, i, v)

Can it be that when mosdepth makes a d4 file it does not use this logic? I can see a call to write, which seems to write in bulk to one table, which is where I assume this comes from? I am not so familair with nim, does it expect all values in arr to fit within the primary table bit-width? This would explain why the mt region, caps out at 127. In mosdepth.nim:

when defined(d4):
    if use_d4:
        fd4.write(target.name, 0, arr)

peterpru avatar Sep 17 '25 13:09 peterpru

Hej again, just to comment on my promise to deliver a short-read mt bam file. The problem only exists on the whole bam file. When I only have the mt in there, d4tools picks a larger bitwidth when sampling the data, and then correct values are shown, so I will not be able to provide a subsampled file. For long-read data it looks like the data is overall more even, leading to 'sometimes' the correct coverage being shown.

My summary after digging into this:

  • The 127 cap is a D4 format design decision, not a necessarily a mosdepth bug.
  • mosdepth uses d4tools in a mode that writes bulk per-base depth arrays.
  • If values exceed the chosen primary bit-width (e.g. 7 bits = 127 max), they should be written to the secondary table, but this fallback might not be triggered in bulk write mode.
  • The cap happens because values over 127 don’t fit into the primary table, and if write() doesn't route them to the secondary table, they're clamped to 127.
  • d4tools selects bit-width based on a sampling step, assuming typical genome coverage (~100×), which may not account for high-depth regions like mitochondrial DNA.

Suggested solutions:

  1. allow users to pick the bitwidth, e.g. by giving the max expected coverage
  2. implement secondary table writing as in the d4tools nim file displayed above.
  3. Use d4tools independently on the bed output of mosdepth.

peterpru avatar Sep 30 '25 11:09 peterpru

Hi @peterpru , thanks for following up with a very thorough analysis. Can you clarify if this fix which you indicate:

if v > max_value:
    fd4.write_secondary(chrom, i, v)
else:
    fd4.write_primary(chrom, i, v)

would resolve the issue? And it's also not clear to me from your description if this issue is ever present in only d4tools. Is it? thanks again. Once I understand more, I'll certainly try to get the fix in.

brentp avatar Sep 30 '25 16:09 brentp

Hi @peterpru , thanks for following up with a very thorough analysis. Can you clarify if this fix which you indicate:

if v > max_value: fd4.write_secondary(chrom, i, v) else: fd4.write_primary(chrom, i, v) would resolve the issue? And it's also not clear to me from your description if this issue is ever present in only d4tools. Is it? thanks again. Once I understand more, I'll certainly try to get the fix in.

Hi, apologies, this is a very late response. The 127 problem is not present in d4tools itself.

Regarding the fix, I am not 100% sure it will fix it, as I have not delved too deep into the code of d4tools or mosdepth, but as I understand it, d4tools stores all the values under e.g. 127 in table 1 and everything over in table 2. Mosdepth uses only the primary table, so all values higher than 127 get 127.

peterpru avatar Oct 14 '25 09:10 peterpru