d4-format
d4-format copied to clipboard
d4tools create can create wrong data for the last entry of a chromosome when writing non-compressed data.
d4tools create can create wrong data for the last entry of a chromosome when writing non-compressed data:
# Run d4tools create:
# -q 10: Filter reads with mapping quality smaller than 10.
# -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
# -z: Use compression as uncompressed d4tools output sometimes writes a wrong depth value as last entry of a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compresssed.d4
# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.d4 \
> test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph
# Run d4tools create:
# -q 10: Filter reads with mapping quality smaller than 10.
# -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
# without compression: sometimes writes a wrong depth value as last entry for a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4
# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4 \
> test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph
# With compressed output samtools depth and d4tools give the same.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph
# With uncompressed output samtools depth and d4tools give the same.
# Only for 3 chromosomes out of 1870 chromosomes this problem appears.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph
--- test.samtools_depth_MQ10_count_deletions.bedGraph 2025-01-20 11:37:33.000000000 +0100
+++ test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph 2025-01-20 17:47:31.000000000 +0100
@@ -15016909,7 +15016909,8 @@
211000022280772 13654 13701 1
211000022280772 13701 13713 0
211000022280772 13713 13762 1
-211000022280772 13762 15417 0
+211000022280772 13762 13763 0
+211000022280772 13763 15417 2
211000022280659 0 594 0
211000022280659 594 642 1
211000022280659 642 703 0
@@ -15017355,7 +15017356,8 @@
211000022279098 12261 12270 3
211000022279098 12270 12289 2
211000022279098 12289 12310 1
-211000022279098 12310 12368 0
+211000022279098 12310 12311 0
+211000022279098 12311 12368 2
211000022280761 0 75 0
211000022280761 75 123 1
211000022280761 123 1433 0
@@ -15030169,7 +15030171,8 @@
211000022280548 1426 1434 1
211000022280548 1434 1474 2
211000022280548 1474 1483 1
-211000022280548 1483 2618 0
+211000022280548 1483 1484 0
+211000022280548 1484 2618 2
211000022280553 0 2561 0
211000022280554 0 228 0
211000022280554 228 277 1
@@ -15043475,7 +15043478,8 @@
211000022279504 1843 1850 15
211000022279504 1850 1854 14
211000022279504 1854 1855 12
-211000022279504 1855 2317 0
+211000022279504 1855 1856 0
+211000022279504 1856 2317 2
211000022279505 0 44 0
211000022279505 44 92 1
211000022279505 92 114 0
When running d4tools create only for those chromosomes, it does not seem to happen.
When adding a bunch of other chromosomes, sometimes it gives the correct output and sometimes the wrong output, so I assume it has something to do with a reuse of a buffer that does not get cleared properly (and due to threading, it does not always contain the same value):
uncompressed_output_instability () {
/staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools create -q 10 -F '~1796' test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4 -f '^2R|^211000022[0-9]{6}' > /dev/null
/staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4 \
| rg -A 5 211000022280548 \
| tail
}
$ uncompressed_output_instability
211000022280548 1426 1434 1
211000022280548 1434 1474 2
211000022280548 1474 1483 1
211000022280548 1483 1484 0
211000022280548 1484 2618 2 ==> wrong
211000022280553 0 2561 0
211000022280554 0 228 0
211000022280554 228 277 1
211000022280554 277 330 0
211000022280554 330 377 1
$ uncompressed_output_instability
211000022280548 1265 1426 0
211000022280548 1426 1434 1
211000022280548 1434 1474 2
211000022280548 1474 1483 1
211000022280548 1483 2618 0 ==> correct
211000022280553 0 2561 0
211000022280554 0 228 0
211000022280554 228 277 1
211000022280554 277 330 0
211000022280554 330 377 1
$ uncompressed_output_instability
211000022280548 1426 1434 1
211000022280548 1434 1474 2
211000022280548 1474 1483 1
211000022280548 1483 1484 0
211000022280548 1484 2618 2 ==> wrong
211000022280553 0 2561 0
211000022280554 0 228 0
211000022280554 228 277 1
211000022280554 277 330 0
211000022280554 330 377 1
$ uncompressed_output_instability
211000022280548 1426 1434 1
211000022280548 1434 1474 2
211000022280548 1474 1483 1
211000022280548 1483 1484 0
211000022280548 1484 2618 2 ==> wrong
211000022280553 0 2561 0
211000022280554 0 228 0
211000022280554 228 277 1
211000022280554 277 330 0
211000022280554 330 377 1
$ uncompressed_output_instability
211000022280548 1265 1426 0
211000022280548 1426 1434 1
211000022280548 1434 1474 2
211000022280548 1474 1483 1
211000022280548 1483 2618 0 ==> correct
211000022280553 0 2561 0
211000022280554 0 228 0
211000022280554 228 277 1
211000022280554 277 330 0
211000022280554 330 377 1
Originally posted by @ghuls in https://github.com/38/d4-format/issues/97#issuecomment-2602908588