MACS icon indicating copy to clipboard operation
MACS copied to clipboard

Issue: different versions gives different results

Open sklasfeld opened this issue 4 years ago • 3 comments

Use case I had 20 different bam files from different publications (so some have different read-sizes) and I ran MACS2 with them. For the code below I will just refer to the file names for all 20 as 'X'.

I first ran MACS version 2.1.1.20160309 with the following command: macs2 callpeak -t X -f BAM --keep-dup auto --nomodel -g 101274395 --broad --outdir outdir -n pooled_treatments

I then ran MACS2 version 2.2.7.1 with the following command: macs2 callpeak -t X -f BAM --keep-dup auto --nomodel -g 101274395 --broad --outdir outdir -n pooled_treatments2

I get drastically different results. The results in the new version are less broad.

Describe the problem Different versions of MACS2 give wildly different results

Describe the solution you tried I haven't found a way to get the new code to run like the old code. Ive tried extending the maxGap length, the minLength, I tried not using the broad parameter. None of these methods seem to give the results from the previous version.

Additional context Not that I can think of.

sklasfeld avatar Sep 23 '20 15:09 sklasfeld

There are many changes over years including fixing bugs related to overflow issues https://github.com/macs3-project/MACS/blob/master/ChangeLog. I am not sure exactly which one causes the difference in your data analysis. Could you share the results from both versions?

taoliu avatar Sep 30 '20 20:09 taoliu

@sklasfeld Thanks for sharing! I have checked the codes between 2.1.1.20160309 and the most recent 2.2.7 and don't see big difference on peak calling part, and I tested the output from my testing data and the two versions outputted similar results -- the differences are caused by precisions. However, there have been some issues fixed related to computing p/q-values when numbers are big in the old version. As for your data, it seems you have totally 400million reads on a 100million bp genome. My feeling is that the coverage may be too high for a single peak-calling process. By average, the 400million 50bps reads will provide 200x coverage on the 100mbp genome. The sequencing depth may be over-saturated. Do you mind 'subsample' your data (e.g. 10% or even 1%) and run peak calling again to see if the issue is still there?

taoliu avatar Oct 08 '20 22:10 taoliu