MACS
MACS copied to clipboard
Better cutoff levels control for bdgcallpeak cutoff-analysis
I'm running bdgcallpeak on a bedGraph file optained from an input normalised bigwig file (operation used was subtraction). The minimum value in the file is approximately -5000, and the maximum is approx 1600. Therefore, when running bdgcallpeak with cutoff-analysis to see how many peaks different thresholds yield, I get this output:
score npeaks lpeaks avelpeak
1628.50 2 400 200.00
1537.51 3 600 200.00
1446.52 3 600 200.00
1355.53 6 1200 200.00
1264.54 7 1400 200.00
1173.55 8 1600 200.00
1082.56 8 1600 200.00
991.57 8 1600 200.00
900.58 8 1650 206.25
809.59 8 1800 225.00
718.60 8 1850 231.25
627.62 8 1950 243.75
536.63 12 2950 245.83
445.64 21 4800 228.57
354.65 64 13800 215.62
263.66 239 50350 210.67
172.67 1264 267400 211.55
81.68 8335 1857700 222.88
-9.31 8536 2980185963 349131.44
-100.30 497 1609818765 3239071.96
-191.29 322 1493852546 4639293.62
-282.28 250 1375339296 5501357.18
-373.27 241 1276799008 5297921.20
-464.26 215 1091708608 5077714.46
-555.24 203 962565827 4741703.58
-646.23 173 834599219 4824272.94
-737.22 157 833794369 5310792.16
-828.21 126 799328119 6343873.96
-919.20 103 799320255 7760390.83
-1010.19 89 799323455 8981162.42
-1101.18 84 787521955 9375261.37
-1192.17 72 787524255 10937836.88
-1283.16 55 742078594 13492338.07
-1374.15 51 742079994 14550588.12
-1465.14 42 606310444 14435962.95
-1556.13 33 465246139 14098367.85
-1647.12 32 465246489 14538952.78
-1738.11 32 465247089 14538971.53
-1829.09 31 465247139 15007972.23
-1920.08 26 465243889 17893995.73
-2011.07 23 465244139 20228006.04
-2102.06 24 465244389 19385182.88
-2193.05 23 465244639 20228027.78
-2284.04 22 465244839 21147492.68
-2375.03 22 465245589 21147526.77
-2466.02 22 465245989 21147544.95
-2557.01 19 465245439 24486602.05
-2648.00 16 465245639 29077852.44
-2738.99 15 465245689 31016379.27
-2829.98 13 333369089 25643776.08
-2920.97 13 333369089 25643776.08
-3011.96 13 333369089 25643776.08
-3102.95 12 333369339 27780778.25
-3193.94 12 333369439 27780786.58
-3284.92 11 333369489 30306317.18
-3375.91 10 333310289 33331028.90
-3466.90 10 333310289 33331028.90
-3557.89 8 124043389 15505423.62
-3648.88 8 124043389 15505423.62
-3739.87 6 123766200 20627700.00
-3830.86 6 123766250 20627708.33
-3921.85 6 123766250 20627708.33
-4012.84 6 123766300 20627716.67
-4103.83 6 123766300 20627716.67
-4194.82 6 123766350 20627725.00
-4285.81 5 123765950 24753190.00
-4376.80 5 123765950 24753190.00
-4467.79 5 123766000 24753200.00
-4558.77 4 123766050 30941512.50
-4649.76 4 123766050 30941512.50
-4740.75 4 123766050 30941512.50
-4831.74 3 123766100 41255366.67
-4922.73 2 123766200 61883100.00
-5013.72 1 123766250 123766250.00
-5104.71 1 123766250 123766250.00
A lot of these results are useless, as they explore negative cutoffs that end up lumping large amounts of the genomes together into "peaks". More control over the cutoffs tested by cutoff-analysis could give me better insight into the distribution of my data.
One basic solution would be to add a "no-negative" option, which prevents cutoff-analysis from considering negative thresholds, which seems relevant when dealing with data normalised like mine. A more in-depth approach would be to let the user specify the min, max and step to use to calculate the cutoffs (which I assume is what cutoff-analysis with cutoff-analysis-steps does, but using the min and max in the file) to allow full control, but I don't know how feasible that is.
Currently I'm rerunning the analysis with 500 cutoffs, but still more than half of them will be unusable.