atacseq icon indicating copy to clipboard operation
atacseq copied to clipboard

Peak calling parameters might not be ideal

Open ktrns opened this issue 2 years ago • 3 comments

Dear all,

We had started to discuss this topic in Slack, but I decided to raise this issue so the comments won't get lost.

In brief, you currently use the macs2 settings originally suggested for ATAC-seq data: -f "BAMPE" --nomodel. I have spent some time on reading, and I find this discussion very helpful: https://twitter.com/XiChenUoM/status/1336658454866325506

In brief, by using -f "BAMPE" we focus on only 1 of the 2 cut sites of the fragment, which does not seem to be the most appropriate way to handle ATAC-seq data.

Based on what I read, converting bam to bed, and then use -f bed (or bedpe, still have to figure that out) together with --shift -75 --extsize 150 is better. You would basically use both reads R1 and R2, and create 150bp reads with the mid-point on the cut-site of the transposase. Another twitter reply to the above thread said that Encode3 used --shift -37 --extsize 73.

I will likely start to try this to see if it gives an improvement.

Thanks a lot for your valuable work on the pipeline! Katrin

ktrns avatar Feb 17 '22 13:02 ktrns

Dear all,

I have had a closer look into this issue. I believe that the way macs2 is used for this pipeline is wrong. To understand my point it is very helpful to look at the link I pasted above.

For ChIP-seq, we are interested in the region between paired reads. In contrast, for ATAC-seq, we are interested in the individual reads as they represent the cut sites of the Tn5 transposase and hence open chromatin.

As can be seen in the insert size distribution of ATAC-seq data (wavy pattern), many read pairs span one or more nucleosomes. In its current state, the pipeline produces bigWig files and calls macs2 based on fragments, i.e. regions spanned by read pairs. This means that often what we detect are nucleosomes rather than open chromatin.

One reason the issue was never really obvious in IGV is that also the bigWig files are generated based on fragments in the nf-core code. This way the "wrong" peaks match the "wrong” coverage tracks.

I gave it a try and called macs2 as suggested above in my first post:

# bam to bed
# We want to treat each cut site as a single event
bedtools bamtobed -i sampleA.sorted.bam >> sampleA.sorted.bed

# macs2
macs2 \
  callpeak \
  --keep-dup all --nomodel \
  -f BED \
  --shift -75 --extsize 150 \
  --name sampleA \
  --treatment sampleA.bed

In the attached image, you see the following tracks from top to bottom:

  • consensus peaks produced by the pipeline (black; there were more samples than the one shown)
  • bigWig produced by the pipeline (dark blue)
  • narrow peaks detected by the pipeline (lighter blue)
  • coverage track generated by me based on a single-end bed file (which itself is based on the bam file by the pipeline) (salmon coloured)
  • macs peaks generated by me (pink) with --shift -75 --extsize 150
  • macs peaks generated by me (pink) with --shift -100 --extsize 200
  • loaded bam file produced by the pipeline
Screenshot 2023-05-22 at 20 29 48

I should say that also this representation is not ideal, as in theory the visualised reads are running into the nucleosome, whereas macs2 uses one end of the read and generates artificial reads around the cut site itself. Therefore reads and peaks still don’t match exactly.

The example still shows the effect of using fragments rather than the cut sites. The fragments (blue tracks) pretty much include/span (very likely) one nuclesome.

I hope all of this makes some sense to you :-). Please share your opinion on this, am I misinterpreting the results? I am curious to see what you think.

I will try a few more examples, and see if I can fix this systematically for the pipeline.

Best & many thanks Katrin

ktrns avatar May 22 '23 18:05 ktrns

@ktrns and @JoseEspinosa,

This may be a silly question but in the region that is shown above, we are assuming the depleted region between the two peaks is where the nucleosome is, right? Can't it be where TF bind instead? How do you tell the difference?

Also, if it is TF binding, having a peak that only capture the open chromatin region on either side of the TF will not include their motifs. When we do motif discovery on the peaks, then the expected TF motifs won't be found?

Also in this suggested macs callpeak parameter, we no longer call broad peaks?

TIA for your insights.

olechnwin avatar Aug 03 '23 17:08 olechnwin

Hello,

I saw that some papers in which they use macs2 for ATAC-seq they previously filter the fragment sizes to get a set corresponding to nucleosome free regions (NFR) and then they apply macs2 callpeak to them. Do you think this would solve the issue that @ktrns mentions about "visualised reads are running into the nucleosome"? Let me know what you think about this.

Best regards, Sergio

SergioRodLla avatar Jan 24 '24 08:01 SergioRodLla