Peak Calling with macs2
Hello, thanks again for the nice tool! I have been using your function for pseudobulk and macs2 peak calling and while it seems to work fine, after inspection of peak calls I found them to be weird in some cases, especially at highly transcribed genes, where large peaks spanning the entire genes are found, rather than defined peaks. I am attaching example plots where I compare 3 tracks
- Default parameters of the call_peaks_macs() function:
--format BED --call-summits --keep-dup all --shift -75 --extsize 150 --nomodel --nolambda - Remove duplicated in the call_peaks_macs() function:
--format BED --call-summits --keep-dup 1 --shift -75 --extsize 150 --nomodel --nolambda - Default peak calling with Signac CallPeaks() function:
--format BED --keep-dup 1 --nomodel --extsize 200 --shift -100 - Peak calling with Signac CallPeaks() function using parameters to match bpcell default:
--format BED --call-summits --keep-dup all --shift -75 --extsize 150 --nomodel --nolambda
The peak tracks are from Monocyte CD14 cells and I am showing an example of highly transcribed gene (HLA-DRA) and a gene that is not expressed (but that show accessible sites nevertheless and have same calls in all 4 versions)
Here I am also attaching the number of peaks and size distribution
I know that there are are a lot of different elements that go into the peak calling such as the input BED file (signac uses the fragment file coordinates), the shift/extend option, BED or BEDPE, keep dup 1 or keep dup all, so I am starting to get confused on which is the best combination of parameters to get peaks that are of reasonable size and reflecting the celltype of origin. Number 3 looks the best however I think I am missing a lot of signal. Perhaps you guys that have thought a lot about it have a solution? Or at least an explanation regarding the method you have implemented and your best practice suggestion. Thanks a lot! Paola
Hi Paola @plbngl, Thanks for using our tool!
This actually was a big point of discussion in our lab, and we get a lot of different opinions on how to use the peak caller.
To give some initial perspective on how I wrapped macs, each one of your IterableFragments can be reduced down to a long tibble of chr, start, end, and vals tuples. We then convert this into a bed by taking each start and end coordinate (can be parameterized), and turning them into length 1 insertions. Signac instead just uses the deduped frag files but still feeds the entire fragment information (paired ends).
I personally used the defaults from suggestion from Ryan Corces (ArchR), but there is intuition why these defaults were chosen. The dupes are held because identical cut positions can still be real signal. The shifts and extensions centers fragments on the tn5 insertion, by shifting each read by half the extension size. The --nolambda is ignores the dynamic background, which is good for summit centric recipes for sensitivity especially in low sample size pseudobulks, but has a higher false positive rate. The expectation is that the user will filter based on blacklist. However, I should make this a little bit more clear in the documentation.
In any case, I would be interested in seeing your results when removing --nolambda on your peak calling. I'd be happy to change the defaults given how this tool is made for larger datasets anyways.
Also just in case, we default to macs3 if it is installed in your system, and you do not provide a path executable. Was the peak calling in BPCells set to macs2?
I am busy this week, but I would be happy to chat more offline and potentially turn this into a short article on our website going forward.
Hello! Thanks for the useful discussion! All is very clear. Initially I thought that the main difference would be the keep-dup but after some tests and your suggestion it is clear that the main difference is due to the --nolambda parameter. My pseudobulks range from a minimum of 500 cells to a max of 10,000 (capped) and the monocyte example I am showing is 10k cells, so it makes sense. Below I paste the repeated analysis with the default params or changing either the keepdup, extsize or lambda options.
Given the results above it seems that the nolambda false keep peak calls short (too short maybe?) and less sensitive, which seems visually more correct in this case and resolve the issue I saw (but maybe now miss some signal?).
-
Yes I used a macs2 executable not macs3 -- what would be the main difference?
-
Extra question - how to change the trackplot_genome_annotation() to a different color scale than Purple, for example Grayscale?
-
Bonus discussion point: when merging peaks calls, I see you provide the iterative approach from ArchR, but I cannot decide if I want to use it because 1) peaks remain pretty small even after merging ( most <500 bp, and < 4kb the longest one), 2) unclear how to normalize scores and sort peaks, and how this affect the celltype from which the peak is kept (which might not be the one with the actual best signal and therefore best call? Do you have a comment on that as well?
Thank you very much again, Best Paola
Hi @plbngl, that is around what we would expect!
Just to answer your questions first:
Yes I used a macs2 executable not macs3 -- what would be the main difference?
I haven't done in-depth analysis on this but from what I understand there is a little bit better single-cell resolution for CHIP and scATAC data. I don't think it would hurt to try swapping out your macs and seeing if it gives different results.
Extra question - how to change the trackplot_genome_annotation() to a different color scale than Purple, for example Grayscale?
The colors argument! We set it as a default based off of enrichment: c(munsell::mnsl("5P 3/12"), munsell::mnsl("5P 7/12"). But you can do something really simple like colors = "grey". We also went back on forth on colour choice when choosing styling. I think if you don't give a color_by argument, everything is just set to black.
Bonus discussion point: when merging peaks calls, I see you provide the iterative approach from ArchR, but I cannot decide if I want to use it because 1) peaks remain pretty small even after merging ( most <500 bp, and < 4kb the longest one), 2) unclear how to normalize scores and sort peaks, and how this affect the celltype from which the peak is kept (which might not be the one with the actual best signal and therefore best call? Do you have a comment on that as well?
The reason why peaks remain small is because we conciously choose to keep peak sizes fixed, and only remove peaks that overlap with lower significance values. We do not merge, but you are free to merge using bedtools though!
We reimplement part of the iterative approach from ArchR. You would need to sort by a significance values normalized by cell type. (https://www.archrproject.com/bookdown/the-iterative-overlap-peak-merging-procedure.html). There is re-normalization, but unfortunately I do not have intuition of how this renormalization process works. Here is the function if you are interested though: addReproduciblePeakSet(), This is a part of BPCells that is frankly under-developed, and I will add this to my priority list. You can probably create a minimal ArchR project with with the sole purpose of peak calling.
It seems like in these cases, the peaks adequately are called on summits. I do still worry on the sensitivity side. I would still probably suggest using default params and filtering by blacklist though.
Understandably, this is not the answer you are probably looking for for the last question, and I'm sorry I can't help much further until I get a little bit more knowledge. I'll see what I can do to get this functionality up and running asap.
No Problem, thank you for taking the time to answer all my questions!! Thanks again!