CD4-csaw icon indicating copy to clipboard operation
CD4-csaw copied to clipboard

Reproducible reanalysis of a combined ChIP-Seq & RNA-Seq data set

Re-analysis of a combined ChIP-Seq & RNA-Seq data set

This is the code for a re-analysis of a GEO dataset that I originally analyzed for this paper using statistical methods that were not yet available at the time, such as the csaw Bioconductor package, which provides a principled way to normalize windowed counts of ChIP-Seq reads and test them for differential binding. The original paper only analyzed binding within pre-defined promoter regions. In addition, some improvements have also been made to the RNA-seq analysis using newer features of limma such as quality weights.

This workflow downloads the sequence data and sample metadata from the public GEO/SRA release, so anyone can download and run this code to reproduce the full analysis.

Workflow

Rule Graph

Completed components

  • ChIP-seq
    • Mapping with bowtie2
    • Peak calling with MACS2 and Epic
    • Fetching of blacklists from UCSC
    • Generation of greylists from ChIP-Seq input samples
    • IDR analysis of blacklist-filtered peak calls
    • Computation of cross-correlation function for ChIP-Seq samples, excluding blacklisted regions
    • Counting in windows across the genome
  • RNA-seq
    • Mapping with STAR & HISAT2
    • Counting reads aligned to genes
    • Alignment-free bias-corrected transcript quantification using Salmon & Kallisto
    • Differential gene expression

Possible TODO components

  • Integrating RNA-seq and ChIP-seq
    • hiAnnotator: http://bioconductor.org/packages/devel/bioc/html/hiAnnotator.html
    • ChIPseeker: http://bioconductor.org/packages/devel/bioc/html/ChIPseeker.html
    • mogsa: http://bioconductor.org/packages/release/bioc/html/mogsa.html
  • Gene set tests
    • ToPASeq: http://bioconductor.org/packages/devel/bioc/html/ToPASeq.html
    • mvGST: http://bioconductor.org/packages/devel/bioc/html/mvGST.html
    • mgsa: http://bioconductor.org/packages/release/bioc/html/mgsa.html
  • QC Stuff
    • ChIPQC: http://bioconductor.org/packages/release/bioc/html/ChIPQC.html
    • MultiQC: http://multiqc.info/
    • Rqc: http://www.bioconductor.org/packages/devel/bioc/html/Rqc.html
  • mixOmics: http://mixomics.org/
  • ica: https://cran.rstudio.com/web/packages/ica/index.html
  • Motif enrichment
  • pcaExplorer: https://bioconductor.org/packages/release/bioc/html/pcaExplorer.html

TODO Code cleanup

  • Remove unnecessary library() calls
  • Put spaces around equals signs

TODO Other

  • Document how to run the pipeline
  • Provide install script for R & Python packages.

Dependencies

Command-line tools

Programming languages and packages

  • R, Bioconductor, and the following R packages:
    • From CRAN: assertthat, doParallel, dplyr, future, getopt, GGally, ggforce, ggfortify, ggplot2, ks, lazyeval, lubridate, magrittr, MASS, Matrix, openxlsx, optparse, parallel, purrr, RColorBrewer, readr, reshape2, rex, scales, stringi, stringr
    • From Bioconductor: annotate, Biobase, BiocParallel, BSgenome.Hsapiens.UCSC.hg19, BSgenome.Hsapiens.UCSC.hg38, ChIPQC, csaw, edgeR, GenomicFeatures, GenomicRanges, GEOquery, limma, org.Hs.eg.db, Rsamtools, Rsubread, rtracklayer, S4Vectors, SRAdb, SummarizedExperiment, TxDb.Hsapiens.UCSC.hg19.knownGene, tximport
    • Installed manually: sleuth, wasabi
  • Python 3 and the following Python packages: biopython, atomicwrites, numpy, pandas, plac, pysam, rpy2, snakemake