CEMBA_wmb_snATAC
CEMBA_wmb_snATAC copied to clipboard
Whole mouse brain snATAC seq analysis
- CEMBA_wmb_snATAC This repository is used for the whole mouse brain (wmb) snATAC-seq data analysis of Center for Epigenomics of the Mouse Brain Atlas (CEMBA), which is now accepted by Nature 2023.
[[./repo_figures/GraphAbstract.jpg]]
- Change Log
- 2024-09-25: add explanation in Discusssion for 4D and 4E dissections.
- 2024-10-19: add google drive link for 2.3 million cell meta data.
- Important Note All the analysis and the h5ad data generated are from SnapATAC2 under <= 2.4.0 There are some [[https://kzhang.org/SnapATAC2/changelog.html][break changes]] later after SnapATAC2 >= 2.5.0
** Data
- Our dissection 4D annotated in the meta data, should be 4E; and 4E should be 4D.
- This might be a label issue during experiment record. We are not that sure.
- Check the disccusion for details info: https://github.com/beyondpie/CEMBA_wmb_snATAC/discussions/20 .
- In our repository, we keep everything now unchanged. So if you need 4D region, you should give 4E a look and vice versa.
- Demultiplexed data can be accessed via the NEMO archive (NEMO, RRID:SCR_016152) at https://assets.nemoarchive.org/dat-bej4ymm (the raw directory under Source Data URL in this archive).
- We also uploaded our demultiplexed fastq files and processed files
under the GEO accession number GSE246791:
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246791
- Processed data is also available on our web portal and can be explored here: http://www.catlas.org.
- A Google drive link for bigwig files and SnapATAC2 files just for backup:
- https://drive.google.com/drive/folders/1mule-CkxN939mxhYAqpOIxjPpq0j8qVP?usp=sharing
- A Google drive link for bigwig files and SnapATAC2 files just for backup:
- All the cellmeta information:
- Google drive link: https://drive.google.com/drive/folders/1UNeOar5UuBNrIjB_z53s25nrm81arXPW?usp=share_link
** Pipeline - We now have 234 samples and 2.3 million cells in total. So most of the analysis are depend on Snakefile to organize the pipeline and submit them to high-performance cluster (HPC) in order to use hundreds of CPUs at the same time. - R, Shell and Python (>= 3.10) are mainly used, especially R (>= 4.2). - Under the directory [[./package][package]], we put lots of common functions there. - We mainly use [[https://github.com/kaizhang/SnapATAC2][SnapATAC2]] to analyze the single-nucleus ATAC-seq data - Comparation between Scrublet and AMULET: https://github.com/yuelaiwang/CEMBA_AMULET_Scrublet - The deep learning related codes now in the repo: https://github.com/yal054/mba_dl_model - sa2 is short for SnapATAC2 in this repo.
[[./repo_figures/snATAC-seq_analysis_pipeline.jpg]]
** Codes
*** Clustering
In total, we have implemented four-round iterative clustering.
See details in [[file:01.clustering][01.clustering]]
*** Integration and annotation
We use Allen's scRNAseq data and their annotations for our data annotation.
See details in [[file:02.integration][02.integration]]
*** Peak calling
We use macs2 with multiple stage filtering, especially use SPM >= 5
for filtering peaks.
See details in [[file:03.peakcalling][03.peakcalling]]
*** Comments on some scripts:
- [[file:package/R/cembav2env.R][cembav2env.R]]: R env to store the metadata during analysis. |----------------+-------------------------------------------------------| | Enviorment | Description | |----------------+-------------------------------------------------------| | cembav2env | meta data of SnapATAC and SnapATAC2 | |----------------+-------------------------------------------------------| | cluSumBySa2 | clustering meta data, such as resolution, | | | barcode to L4 Ids, L4 major regions and so on | |----------------+-------------------------------------------------------| | Sa2Integration | Integration meta data, like Allen's data descriptions | |----------------+-------------------------------------------------------| | Sa2PeakCalling | Peak calling meta data | |----------------+-------------------------------------------------------|