MACS
MACS copied to clipboard
Issue: different versions gives different results
Use case I had 20 different bam files from different publications (so some have different read-sizes) and I ran MACS2 with them. For the code below I will just refer to the file names for all 20 as 'X'.
I first ran MACS version 2.1.1.20160309 with the following command:
macs2 callpeak -t X -f BAM --keep-dup auto --nomodel -g 101274395 --broad --outdir outdir -n pooled_treatments
I then ran MACS2 version 2.2.7.1 with the following command:
macs2 callpeak -t X -f BAM --keep-dup auto --nomodel -g 101274395 --broad --outdir outdir -n pooled_treatments2
I get drastically different results. The results in the new version are less broad.
Describe the problem Different versions of MACS2 give wildly different results
Describe the solution you tried I haven't found a way to get the new code to run like the old code. Ive tried extending the maxGap length, the minLength, I tried not using the broad parameter. None of these methods seem to give the results from the previous version.
Additional context Not that I can think of.
There are many changes over years including fixing bugs related to overflow issues https://github.com/macs3-project/MACS/blob/master/ChangeLog. I am not sure exactly which one causes the difference in your data analysis. Could you share the results from both versions?
Here are the broadpeak and xls files
version2.1.1.20160309_peaks.xls.txt version2.2.7.1_peaks.xls.txt version2.2.7.1_peaks.broadPeak.txt version2.1.1.20160309_peaks.broadPeak.txt
@sklasfeld Thanks for sharing! I have checked the codes between 2.1.1.20160309 and the most recent 2.2.7 and don't see big difference on peak calling part, and I tested the output from my testing data and the two versions outputted similar results -- the differences are caused by precisions. However, there have been some issues fixed related to computing p/q-values when numbers are big in the old version. As for your data, it seems you have totally 400million reads on a 100million bp genome. My feeling is that the coverage may be too high for a single peak-calling process. By average, the 400million 50bps reads will provide 200x coverage on the 100mbp genome. The sequencing depth may be over-saturated. Do you mind 'subsample' your data (e.g. 10% or even 1%) and run peak calling again to see if the issue is still there?