d4-format icon indicating copy to clipboard operation
d4-format copied to clipboard

d4tools create can create wrong data for the last entry of a chromosome when writing non-compressed data.

Open ghuls opened this issue 11 months ago • 0 comments

d4tools create can create wrong data for the last entry of a chromosome when writing non-compressed data:

# Run d4tools create:
#   -q 10: Filter reads with mapping quality smaller than 10.
#   -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
#   -z: Use compression as uncompressed d4tools output sometimes writes a wrong depth value as last entry of a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compresssed.d4
 

# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.d4 \
  > test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph


# Run d4tools create:
#   -q 10: Filter reads with mapping quality smaller than 10.
#   -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
#   without compression: sometimes writes a wrong depth value as last entry for a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4

# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4 \
  > test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph
# With compressed output samtools depth and d4tools give the same.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph

# With uncompressed output samtools depth and d4tools give the same.
# Only for 3 chromosomes out of 1870 chromosomes this problem appears.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph
--- test.samtools_depth_MQ10_count_deletions.bedGraph   2025-01-20 11:37:33.000000000 +0100
+++ test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph      2025-01-20 17:47:31.000000000 +0100
@@ -15016909,7 +15016909,8 @@
 211000022280772        13654   13701   1
 211000022280772        13701   13713   0
 211000022280772        13713   13762   1
-211000022280772        13762   15417   0
+211000022280772        13762   13763   0
+211000022280772        13763   15417   2
 211000022280659        0       594     0
 211000022280659        594     642     1
 211000022280659        642     703     0
@@ -15017355,7 +15017356,8 @@
 211000022279098        12261   12270   3
 211000022279098        12270   12289   2
 211000022279098        12289   12310   1
-211000022279098        12310   12368   0
+211000022279098        12310   12311   0
+211000022279098        12311   12368   2
 211000022280761        0       75      0
 211000022280761        75      123     1
 211000022280761        123     1433    0
@@ -15030169,7 +15030171,8 @@
 211000022280548        1426    1434    1
 211000022280548        1434    1474    2
 211000022280548        1474    1483    1
-211000022280548        1483    2618    0
+211000022280548        1483    1484    0
+211000022280548        1484    2618    2
 211000022280553        0       2561    0
 211000022280554        0       228     0
 211000022280554        228     277     1
@@ -15043475,7 +15043478,8 @@
 211000022279504        1843    1850    15
 211000022279504        1850    1854    14
 211000022279504        1854    1855    12
-211000022279504        1855    2317    0
+211000022279504        1855    1856    0
+211000022279504        1856    2317    2
 211000022279505        0       44      0
 211000022279505        44      92      1
 211000022279505        92      114     0

When running d4tools create only for those chromosomes, it does not seem to happen.

When adding a bunch of other chromosomes, sometimes it gives the correct output and sometimes the wrong output, so I assume it has something to do with a reuse of a buffer that does not get cleared properly (and due to threading, it does not always contain the same value):

uncompressed_output_instability () {
    /staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools create -q 10 -F '~1796' test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4  -f '^2R|^211000022[0-9]{6}' > /dev/null

    /staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4 \
        | rg -A 5 211000022280548 \
        | tail
}

$ uncompressed_output_instability
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    1484    0
211000022280548 1484    2618    2  ==> wrong
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

$ uncompressed_output_instability
211000022280548 1265    1426    0
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    2618    0 ==> correct
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

$ uncompressed_output_instability
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    1484    0
211000022280548 1484    2618    2 ==> wrong
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

$ uncompressed_output_instability
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    1484    0
211000022280548 1484    2618    2 ==> wrong
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

$ uncompressed_output_instability
211000022280548 1265    1426    0
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    2618    0 ==> correct
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

Originally posted by @ghuls in https://github.com/38/d4-format/issues/97#issuecomment-2602908588

ghuls avatar Jan 22 '25 12:01 ghuls