Design and Analysis of RNA Sequencing Data

This is the associated GitHub page for the book chapter (please cite this work when using our code from this repo):

Richa Bharti, Dominik G Grimm
Design and Analysis of RNA Sequencing Data
In: Kappelmann-Fenzl M. (eds) Next Generation Sequencing and Data Analysis. Learning Materials in Biosciences. Springer, Cham. (https://doi.org/10.1007/978-3-030-62490-3_11).

A comprehensive outline of RNASeq data analysis including quality assessment, pre-processing alignment/mapping of reads, quantification of genes and isoforms and differential expression (DE) analysis is provided. Step wise optimized codes and data formats alongside brief description and references to the corresponding sections of the book chapter are also provided. Please go through the steps below before you begin to use the workflows.

OS

Any Linux based distro should work. Out test OS is:

Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic

lsb_release -a on a Ubuntu based system.

Hardware

It is no secret that the hardware highly influences the speed of the workflows. And just one same can generate ~10-60 Million reads. The most time consuming tasks are the ones that involve reference based alignment. A modest configuration consists of 16+cores and 16 GB of RAM with TB external hard drive or 48 TB server and 3.6 GHz clock speed if one wants to run it on one’s own workstation. A majority of the diskspace is occupied by reference . Our HW configuration consists of 20 core CPU with 128 GB.

Software and packages

All software should be installed by the user directly as the workflow depends on a lot of external software. Without these the workflow will fail to run.

gcc, g++
java
python 2 python3: pip
R version 3.6.2
git

Reference based analysis:

Get the workflow from git using:

git clone https://github.com/grimmlab/BookChapter-RNA-Seq-Analyses.git

After cloning the repository, you should have the following files in your repo directory:

$ls -1
Rscripts
differential_expression.sh
download_genome_and_annotation.sh
download_install_tools.sh
download_RNASeq_reads_and_QC.sh
metadata.txt
project_config.sh
quantification.sh
reference_based_alignment.sh

Workflow organization:

The workflow centered around two variables which contain the project name and the project path. The project name is the root directory of the project under which the project relevant files are stored. The project path is the path on the system where the project root directory is created. These two variables can be set in a configuration file called as project_conf.sh. The default values of the project_conf.sh are shown here:

PROJECT_NAME=RNASeq_project
PROJECT_PATH=${PWD}

Under the defined project name root directory, four sub-directories are created. They are:

|-- RNASeq_project
|   | analysis
|   | data
|   | reads
|   | tools

analysis: This will contain the output of the analysis performed during each step of the workflow.
data: This will contain the genome, annotation and index genome files.
reads: This will contain the sequencing reads which is used as an input to perform the analysis.
tools: This will contain all the tools and software required to run the workflow.

Brief Description of the Each Step

Download and install tools and python libraries (download_install_tools.sh):

The first step for running the workflow is to download and install all the required tools used by the workflow. This can be done by running the shell script:
```
$./download_install_tools.sh
```
This will first create a working folder structure as shown in the above section. This script will download all the tools required to perform the analysis inside the tools folder as listed below:
```
~/RNASeq_project/tools$ls -1
DEXSeq
FastQC
STAR-2.7.3a
bedtools2
bowtie2
cufflinks
cutadapt
htseq
rseqqc
samtools
sratoolkit
tophat
```
Download Eukaryotic genome and its annotation (download_genome_and_annotation.sh)

Assumption here is that RNASeq data is from mouse or human. The genomes of these species are available and well annotated. The genome file is in fasta format and the annotation file is in gtf/gff format. This script will download the current gencode genome (default: mouse) and its annotation into the data folder. Currently, the latest version of gencode is used for the mouse genome. The version can be changed by modifying GENOME_VERSION parameter and the genome can be changed to human by modifying GENOME_TYPE in the project_conf.sh file:
```
GENOME_TYPE=mouse # or human
GENOME_VERSION=24 #or 33 for human
```
This can be done by running the shell script:
```
$./download_genome_and_annotation.sh
```
After the execution of the script, the data folder will contain:
```
~/RNASeq_project/data$ls -1
mouse-gencode-version-24
```
Output:
```
~/RNASeq_project/data/mouse-gencode-version-24$ls -1
GRCm38.p6.genome.fa
GRCm38.p6.genome.fa.gz
gencode.vM24.2wayconspseudos.gff3
gencode.vM24.2wayconspseudos.gff3.gz
gencode.vM24.annotation.gff3
gencode.vM24.annotation.gff3.gz
gencode.vM24.annotation.gtf
gencode.vM24.annotation.gtf.gz
gencode.vM24.long_noncoding_RNAs.gff3
gencode.vM24.long_noncoding_RNAs.gff3.gz
gencode.vM24.tRNAs.gff3
gencode.vM24.tRNAs.gff3.gz
genome_annoatation_md5sum.txt
```
Download the data and perform quality control (download_RNASeq_reads_and_QC.sh)

This step is used to download an example data to perform the RNASeq analysis. This can be done by running the shell script:
```
$./download_RNASeq_reads_and_QC.sh
```
This script contains three main sub-steps:

a. Downloading the publically available data

Publically available data from PMID: 28738885 is used for the analysis purpose. In the first step data is downloaded using fastq-dump binary from SRA toolkit in the tools folder.

The fastq-dump tool uses the following parameters:
```
fastq-dump \
   ${SRA_ID} \
   --split-files \
   --origfmt \
   --gzip
```
Description:

fastq-dump is the binary

${SRA_ID} are list of fastq files. These SRA files will be downloaded one at a time

--split-files parameter will produce two reads if the data is paired end else single file

--origfmt parameter contains only original sequence name

--gzip will provide the compressed output using gzip

Output: The fastq-dump downloads the following files into the reads/rawreads folder:
```
~/RNASeq_project/reads/rawreads$ls -1
SRR5858228_1.fastq.gz
SRR5858229_1.fastq.gz
SRR5858230_1.fastq.gz
SRR5858231_1.fastq.gz
SRR5858232_1.fastq.gz
SRR5858233_1.fastq.gz
SRR5858234_1.fastq.gz
SRR5858235_1.fastq.gz
SRR5858236_1.fastq.gz
```
NOTE! To read more about fastq-dump parameters click here.

b. Quality assessment of downloaded data:

The downloaded SRA fastq files are then assessed for sequence quality using fastqc program.

The fastqc tool uses the following parameters:
```
fastqc \
   SRR5858228_1.fastq.gz \
   -o rawreads_QA_stats
```
Description:

fastqc is the binary.

SRR5858229_1.fasta.gz is one example file as an input for quality assessment. This is run in a for loop which will run on all downloaded samples.

-o is path to output directory, here it is reads_QA_stats .

Output: The fastqc will generate the output for all downloaded files into the reads/rawreads_QA_stats folder:
```
~/RNASeq_project/reads/rawreads_QA_stats$ls -1
SRR5858228_1_fastqc.html
SRR5858228_1_fastqc.zip
SRR5858229_1_fastqc.html
SRR5858229_1_fastqc.zip
SRR5858230_1_fastqc.html
SRR5858230_1_fastqc.zip
SRR5858231_1_fastqc.html
SRR5858231_1_fastqc.zip
SRR5858232_1_fastqc.html
SRR5858232_1_fastqc.zip
SRR5858233_1_fastqc.html
SRR5858233_1_fastqc.zip
SRR5858234_1_fastqc.html
SRR5858234_1_fastqc.zip
SRR5858235_1_fastqc.html
SRR5858235_1_fastqc.zip
SRR5858236_1_fastqc.html
SRR5858236_1_fastqc.zip
```
Note! For more details about fastqc, read section XXX in book and for parameters check FastQC

c. Quality control of downloaded data:

Based on the assessment of the report quality control (QC) is performed using the cutadapt tool.

The cutadapt tool uses the following parameters:
```
cutadapt \
   -q 20 \
   --minimum-length 35 \
   -a "CTGTCTCTTATACACATCT" \
   -o reads/filtered_reads/SRR5858228_1_trimmed.gz \
   SRR5858229_1.fastq.gz \ #Input file name
   > SRR5858229_1_cutadapt_stats.txt # Output stats for given file
```
Description:

-a is the adapter sequence for trimming

--minimum-length is the minimum sequence length (35)

-q phred score of 20

-o is the output file name

Output: The cutadapt will generated trimmed files into the reads/filtered_reads folder:
```
~/RNASeq_project/reads/filtered_reads$ls -1
SRR5858228_1_trimmed.gz
SRR5858229_1_trimmed.gz
SRR5858230_1_trimmed.gz
SRR5858231_1_trimmed.gz
SRR5858232_1_trimmed.gz
SRR5858233_1_trimmed.gz
SRR5858234_1_trimmed.gz
SRR5858235_1_trimmed.gz
SRR5858236_1_trimmed.gz
```
Note! Details of quality control and how to choose the parameter are chosen is explained in chapter (XXX). To understand parameter in detail check Cutadapt.

Perform reference alignment using genome sequence (reference_based_alignment.sh)

Aligning reads to genome can be challenging due to millions of reads, small read size (2nd generation sequencing), sequencing errors and variation leads to mismatches and indels. In addition to that eukaryotic genome have introns , this means when genome is used a reference, aligner is aware of splice junction in the reference.

This step will perform reference based RNASeq alignment. This can be done by running the shell script:

$./reference_based_alignment.sh

Currently, several splice aware tools are available. In this workflow we have used the two most commonly used tools TopHat2 and STAR. The choice of ALIGNER can be set in the project_conf.sh file:

ALIGNER=star #tophat or star

The table below shows the tools used for each alignment step for both the STAR and TopHat2 aligners.

Alignment Steps	STAR Alignment	TopHat2 Alignment
Genome indexing	STAR (a.)	bowtie2 (c.)
Reference based alignment	STAR (b.)	TopHat2 (d.)
BAM file indexing	samtools (e.)	samtools
Alignment stats	RseQC (f.)	RseQC
Visualization	IGV (g.)	IGV

a. Genome indexing using STAR:

STAR \
--runThreadN 8 \
--runMode genomeGenerate \
--genomeDir data/STAR_Genome_Index \
--genomeFastaFiles data/mouse-gencode-version-24/GRCm38.p6.genome.fa \
--sjdbGTFfile data/mouse-gencode-version-24/gencode.vM24.annotation.gtf

Description:

STAR is the binary

--runMode is set to genomeGenerate which generate genome files

--genomeDir is the path to output directory name

--genomeFastaFiles is the path to genome fasta file with file name

--sjdbGTFfile is the path to the annotation file with file name

--runThreadN is number of threads (default used is 8)

Output: The STAR genome indexing will generated indexed genome files into the data/STAR_Genome_Index folder:

~/RNASeq_project/data/STAR_Genome_Index$ls -1
Genome
SA
SAindex
chrLength.txt
chrName.txt
chrNameLength.txt
chrStart.txt
exonGeTrInfo.tab
exonInfo.tab
geneInfo.tab
genomeParameters.txt
sjdbInfo.txt
sjdbList.fromGTF.out.tab
sjdbList.out.tab
transcriptInfo.tab

Note! For more detail read section XXXX in the book and for parameters check STAR

b. Alignment using STAR:

STAR \
 --runMode alignReads \
 --runThreadN 8 \
 --genomeDir data/STAR_Genome_Index \
 --readFilesIn reads/filtered_reads/SRR5858228_1_trimmed.gz \
 --readFilesCommand zcat \
 --sjdbGTFfile data/mouse-gencode-version-24/gencode.vM24.annotation.gtf \
 --outFileNamePrefix /analysis/mapping/star/SRR5858228_1_trimmed  \
 --outSAMtype BAM SortedByCoordinate

Description:

STAR is the binary

--runMode is alignReads which means map reads to the reference genome

--runThreadN is the number of threads (default used is 8)

--genomeDir is the path to genome indexed directory

--readFilesIn is the path to trimmed reads including file name

--readFilesCommand for gzipped files (*.gz) use zcat

--sjdbGTFfile is the path annotation file with file name

--outFileNamePrefix is output prefix name with its path

--outSAMtype is the output sorted by coordinate, similar to samtools sort command

Output: The STAR genome alignment will generated in files into the analysis/mapping/star folder:

This is the output for one sample, similar files will be generated for other samples as well

~/RNASeq_project/analysis/mapping/star$ls -1
Percentage_uniquely_mapped_reads.csv
SRR5858228_1_trimmed.Aligned.sortedByCoord.out.bam
SRR5858228_1_trimmed.Aligned.sortedByCoord.out.bam.bai
SRR5858228_1_trimmed.Log.final.out
SRR5858228_1_trimmed.Log.out
SRR5858228_1_trimmed.Log.progress.out
SRR5858228_1_trimmed.SJ.out.tab
SRR5858228_1_trimmed._STARgenome

SRR5858228_1_trimmed.Aligned.sortedByCoord.out.bam is aligned bam file
SRR5858228_1_trimmed.SJ.out.tab is a tab-delimited file that provides information about alignments to splice junctions
Percentage_uniquely_mapped_reads.csv file represents the uniquely mapped reads

These three files are as names suggests provides the information about ongoing samples alignment. Out of these file with Log.final.out extension provides the complete mapping stats

SRR5858228_1_trimmed.Log.final.out
SRR5858228_1_trimmed.Log.out
SRR5858228_1_trimmed.Log.progress.out

Note! For more detail read section XXXX in the book and for parameters check STAR

c. Genome indexing using bowtie for TopHat2:

TopHat2 uses bowtie2 (which is gapped aligner) for genome indexing

bowtie2-build \
   -f data/mouse-gencode-version-24/GRCm38.p6.genome.fa \
   /data/TOPHAT_Genome_Index/ \
   -p 8

Description:

bowtie2-build is the binary

-f is the path to genome fasta file with file name

-p is to launch a specified number of parallel search threads

TOPHAT_Genome_Index is the output directory created in data folder

Output: The bowtie-build genome indexing will generated indexed genome files into the data/TOPHAT_Genome_Index folder:

~/RNASeq_project/data/TOPHAT_Genome_Index$ls -1
GRCm38.p6.genome.1.bt2
GRCm38.p6.genome.2.bt2
GRCm38.p6.genome.3.bt2
GRCm38.p6.genome.4.bt2
GRCm38.p6.genome.fa
GRCm38.p6.genome.rev.1.bt2
GRCm38.p6.genome.rev.2.bt2

Note! Bowtie2 itself is not able to perform spliced alignments. Bowtie2 is known for its speed and small-memory efficiency because it index the reference genome using an FM index which is a Burrows–Wheeler transform method. For more detail read section XXXX in the book.

d. Alignment using TopHat2:

tophat \
   -o /analysis/mapping/tophat \
   -p 8 \
   -G data/mouse-gencode-version-24/gencode.vM24.annotation.gtf \
   /data/TOPHAT_Genome_Index/GRCm38.p6.genome \
   reads/filtered_reads/SRR5858228_1_trimmed.gz

Description:

tophat is the binary

-o is the name of the directory in which TopHat will write all of its outputs

-p is the number of threads to align reads (default used is 8)

-G is the path to the annotation file with file name

/data/TOPHAT_Genome_Index/GRCm38.p6.genome is the path and prefix name of genome fasta file

reads/filtered_reads/SRR5858228_1_trimmed.gz is the path to trimmed reads including file name

Output: The tophat genome alignment will generated in files into the analysis/mapping/tophat folder

TopHat generated many files as listed below. Here is example output for one SRA ID SRR5858228:

SRR5858228_1_trimmed_accepted_hits.bam represents the alignment file in bam format.
SRR5858228_1_trimmed_junctions.bed consists of a list of exon junctions in BED formation. An exon junction comprises two blocks where either block is as long as the longest overhang of any read present in the junction sequence. The score is the number of alignments identified for a junction sequence.
SRR5858228_1_trimmed_insertions.bed consists of a list of discovered insertions. In each case chromLeft represents the last genomic base before individual insertions.
SRR5858228_1_trimmed_deletions.bed consists of a list of discovered deletions. In each case chromLeft represents the last genomic base before individual deletions.
SRR5858228_1_trimmed_align_summary.txt contains a summary of alignment rates along with the number of reads and pairs showing multiple alignments.

Note! For more detail read section XXXX in the book and for parameters check TopHat2.

e. BAM file indexing for both STAR or TopHat2:

samtools \
   index analysis/mapping/star/SRR5858228_1_trimmed.bam

Description:

samtools is the binary

index will index a coordinate-sorted BAM file for fast random access

Output: This command will index all the bam files and create .bai files in the respective mapping folder.

Note! For more detail read section XXXX in the book and for parameters check samtools.

f. Generate stats for alignment:

python3 bam_stat.py \
   -q 30 \
   -i analysis/mapping/star/SRR5858228_1_trimmed.bam >  \
   analysis/mapping/star/SRR5858228_1_trimmed_RSeQC_alignment_stats.txt

Description:

bam_stat.py is the python script to calculate bam stats

-i input bam file with its path

-q is the mapping quality to determine uniquely mapped read

RseQC produces table in the respective mapping folder where unique reads are considered if their mapping quality is more than 30.

Output: SRR5858228_1_trimmed.Aligned.sortedByCoord.out._RSeQC_alignment_stats.txt file will be created for SRA sample ID SRR5858228.

#============== # All numbers are READ count #==============
Total records:	72432644

QC failed:	0
Optical/PCR duplicate:	0
Non primary hits	11660623
Unmapped reads:	0
mapq < mapq_cut (non-unique):	5935125

mapq >= mapq_cut (unique):	54836896
Read-1:	0
Read-2:	0
Reads map to '+':	27245688
Reads map to '-':	27591208
Non-splice reads:	42688777
Splice reads:	12148119
Reads mapped in proper pairs:	0
Proper-paired reads map to different chrom:	0

Note! For more detail read section XXXX in the book and for parameters check RseQC.

g. Visualization of alignment (BAM) files:

The indexed bam files along with genome and annotation can be loaded into several genome browsers, including the Integrative Genomics Viewer IGV IGV . Explaining this is beyond the scope of this chapter and recommend to go through https://software.broadinstitute.org/software/igv/UserGuide.

Quantification (quantification.sh)

This step will perform quantification on alignment files generated form reference based RNASeq alignment. This can be done by running the shell script:
```
$./quantification.sh
```
The three ways to perform quantification is XXXXXX aware tools are available. In this workflow we have used the two most commonly used tools XXXXX. The choice of QUANT_COUNT can be set in the project_conf.sh file:
```
QUANT_COUNT=per_gene_bedtools # per_gene_htseq or per_transcript_cufflinks or per_exon_dexseq
```
Quantification methods (Counting reads) Tools

per gene bedtools (a.)

per gene HTSeq (b.)

per transcript Cufflinks (c.)

per exon DEXSeq (d.)

a. Counting reads per gene using bedtools:
```
bedtools intersect \
   -S \
   -wa \
   -c \
   -a genes_only.gff3 \
   -b analysis/mapping/star/SRR5858228_1_trimmed.bam \
   > analysis/quantification/bedtools-count/SRR5858228_1_trimmed_counts.csv
```
Description:

bedtools is the binary

intersect allows one to screen for overlaps between two sets of genomic features

-S is require for different strandedness

-c is for each entry in genomic features A, report the number of hits in genomic features B while restricting to -f (-f is for minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp))

-a is gff annotation file where each genomic features in A is compared to genomic features B in search of overlaps

-b is the input bam file with its path

Output: Output will be generated in analysis/quantification/bedtools-count folder for each file. In the end all the bedtools generated output files are merged to created final file Read_per_features_combined.csv in same folder which will be the input for differential expression analysis.

Note! For more details about bedtools read section XXX in the book and for parameters check bedtools

b. Counting reads per gene using HTSeq:
```
htseq-count \
   -f bam \
   -a 10 \
   -m intersect-strict \
   -s no \
   -t exon \
   -i gene_id \
   analysis/mapping/star/SRR5858228_1_trimmed.bam \
   data/mouse-gencode-version-24/gencode.vM24.annotation.gtf > \
   analysis/quantification/htseq-count/SRR5858228_1_trimmed_counts.csv
```
Description:

htseq-count is the binary

-f is the format of the input data.

-a will skip all reads with MAPQ alignment quality lower than the given minimum value (default: 10).

-m is mode to handle reads overlapping more than one feature. In this case its intersect-strict.

-s is set whether the data is from a strand-specific assay. For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature

-t is the feature type (3rd column in GFF file) to be used, all features of other type are ignored (default, suitable for RNA-Seq analysis using an GENCODE GTF file: exon)

-i is the attribute to be used as feature ID. The default, suitable for RNA-Seq analysis using an GENCODE GTF file, is gene_id.

Output: The outputs will be generated in analysis/quantification/htseq-count folder. In the end all the HTSeq generated output files are combined to created final file Read_per_features_combined.csv in same folder which will be used as an input for differential expression analysis.

Note! For more details about HTSeq read section XXX in the book and for parameters check HTSeq

c. Counting reads per transcript using cufflinks:
```
cufflinks \
   -G data/mouse-gencode-version-24/gencode.vM24.annotation.gtf  \
   -b data/mouse-gencode-version-24/GRCm38.p6.genome.fa \
   -p 8 \
   analysis/mapping/star/SRR5858228_1_trimmed.bam \
   -o analysis/quantification/cufflinks-count
```
Description:

cufflinks is the binary

-G is the path to the annotation file with file name

-b is the path to genome fasta file with file name

-p is the number threads

-o is the name of the directory containing all of its generated output

Output: The outputs will be generated in analysis/quantification/cufflinks-count folder. The output consists of transcript and gene-level FPKM-tracking files, which contain FPKM values and their confidence intervals. FPKM tracking files are also produced when a set of samples is tested for differential expression using Cuffdiff.

Note! For more details about Cufflinks read section XXX in the book and for parameters check Cufflinks.

d. Counting reads per exon using DEXSeq:

The initial steps of a DEXSeq analysis are done using two Python scripts. Importantly, these preprocessing steps can also be done using tools equivalent to these Python scripts, for example, using GenomicRanges infrastructure (10.2) or Rsubread (10.3). The following two steps describe how to do this steps using the Python scripts that are provided within DEXSeq.

Preparing the annotation
```
python2 dexseq_prepare_annotation.py \
   data/mouse-gencode-version-24/gencode.vM24.annotation.gtf \
   analysis/quantification/dexseq-count/DEXSEQ_GTF_annotation.gff
```
The Python script dexseq_prepare_annotation.py takes an GTF file and translates it into a GFF file with collapsed exon counting bins. For more details see Figure 1 of the DEXSeq paper (Anders, Reyes, and Huber (2012)) for an illustration.

Counting reads
```
python2 dexseq_count.py \
   -p no \
   -s no \
   -r name \
   analysis/quantification/dexseq-count/DEXSEQ_GTF_annotation.gff \
   -f bam \
   -a 10 \
   analysis/mapping/star/SRR5858228_1_trimmed.bam \
   analysis/quantification/dexseq-count/SRR5858228_1_trimmed_exon_counts.csv
```
Description:

-p is to set the paired end information. If the data is from a paired-end sequencing run, you need to add the option -p yes

-s If you have used a library preparation protocol that does not preserve strand information (i.e., reads from a given gene can appear equally likely on either strand), you need to inform the script by specifying the option -s no

-r is to indicate whether your data is sorted by alignment position or by read name

-f input reads format

-a to specify the minimum alignment quality. All reads with a lower quality than specified (with default -a 10) are skipped.

Output: The outputs for both the steps will be generated in analysis/quantification/dexseq-count folder with files for all the samples.

Note! For more details about DEXSeq read section XXX in the book and for parameters check DEXSeq.

Quantification methods (Counting reads)	Tools
per gene	bedtools (a.)
per gene	HTSeq (b.)
per transcript	Cufflinks (c.)
per exon	DEXSeq (d.)

Differential Expression (differential_expression.sh)

Differential expression (DE) analysis refers to the identification of genes (or other types of genomic features, such as, transcripts or exons) that are expressed in significantly different quantities in distinct groups of samples, be it biological conditions (drug-treated vs. controls), diseased vs. healthy individuals, different tissues, different stages of development, or something else. This step will perform quantification on alignment files generated form reference based RNASeq alignment. This can be done by running the shell script:

   $./differential_expression.sh

The choice of QUANT_COUNT set in the project_conf.sh file determines the differential expression tool. The table below shows the differential expression tool used for the generated count data:

Generated count data	Differential expression tools
per gene for bedtools output	DESeq2 (a.) -> Differentially expressed genes
per gene for HTSeq output	DESeq2 -> Differentially expressed genes
per exon for DEXSeq output	DEXSeq (b.) -> Differentially expressed exon
per transcript for cufflinks output	cuffdiff (c.) -> Differentially expressed isoforms

a. Differential expression for count data generated using bedtools/HTSeq:

This analysis of differential expression (DE) will identify of genes that are expressed in significantly different quantities in distinct groups biological conditions (vehicle_treated vs. drug_treated).

Rscript DE_deseq_genes_bedtools.R \
   --genecount analysis/quantification/bedtools-count/Read_per_features_combined.csv \
   --metadata metadata.txt \
   --condition condition \
   --outputpath analysis/DE/deseq2/

Description:

--genecount is combined generated from bedtools or htseq along with its path

--metadata is the sample information file

--condition is the column in the metadata data file that will be used for pairwise comparision using DESeq2.

--outputpath is the path to the output directory where the result table and plots will be generated.

Output: Differentially expressed outfiles for bedtools and HTSeq are generated in analysis/DE/deseq2/

~/RNASeq_project/analysis/DE/deseq2$ls -1
PCA-samples_bedtools.pdf
diffexpr-maplot_bedtools.pdf
pvalue_histogram_50_bedtools.pdf
qc-dispersions_bedtools.pdf
qc-heatmap-samples_bedtools.pdf
vehicle_treated_vs_drug_treated_diffexpr-results_bedtools.csv

vehicle_treated_vs_drug_treated_diffexpr-results_bedtools.csv is the final table from deseq2 for the pairwise comparison between the two conditions with stats. The counts are normalized to rlog normalization.
PCA-samples_bedtools.pdf XXXXXXXXXXXXXXXXXX
diffexpr-maplot_bedtools.pdf XXXXXXXXXXXXXXXXXX
pvalue_histogram_50_bedtools.pdf XXXXXXXXXXXXXXXXXX
qc-dispersions_bedtools.pdf XXXXXXXXXXXXXXXXXX
qc-heatmap-samples_bedtools.pdf XXXXXXXXXXXXXXXXXX

Note! For more details about differential expression read section XXX in the book and for analysis details check DESeq2.

b. Differential expression for count data generated using DEXSeq:

Rscript DE_dexseq_exon.R \
  --exoncountpath analysis/quantification/dexseq-count/ \
  --gffFile analysis/quantification/dexseq-count/DEXSEQ_GTF_annotation.gff \
  --metadata metadata.txt \
  --outputpath analysis/DE/dexseq/

Description:

--exoncountpath is the dexseq generated exon count file path

--gffFile is the dexseq generated gff file with its path

--metadata is the sample information file

--outputpath is the path to the output directory where the result table and plots will be generated.

Output: Differentially expressed output files for bedtools and HTSeq are generated in analysis/DE/dexseq/

~/RNASeq_project/analysis/DE/dexseq2$ls -1
diffexpr-results_dexseq.csv
dispersion_plot_dexseq.pdf
ma_plot_dexseq.pdf

diffexpr-results_dexseq.csv is the final table from deseq2 for the pairwise comparison between the two conditions with stats.
dispersion_plot_dexseq.pdf
ma_plot_dexseq.pdf XXXXX

Note! For more details about differential expression read section XXX in the book and for analysis details check DEXSeq

c. Differential expression for count data generated using cufflinks:

cuffdiff \
 -o analysis/DE/cuffdiff \
 -L vt,dt \
 --FDR 0.01 \
 -u data/mouse-gencode-version-24/gencode.vM24.annotation.gtf \
 -p 8 \
 BAMLIST

Description:

-o path to the output folder

-L lists the labels to be used as “conditions”

-FDR cutoff for false discovery rate for the DE analysis

-u path to annotation file

-p is the number threads

BAMLIST is list of all bam file in comma format

In the end these script will create a folders and sub-folders based on the choice of analysis on the current data as shown below:

|-- RNASeq_project
|   | analysis
|   |   |-- mapping
|   |   |   |-- tophat/star
|   |   |-- quantification
|   |   |   |-- bedtools-count/cufflinks-count/dexseq-count/htseq-count
|   |   |-- DE
|   |   |   |-- deseq2/dexseq/cuffdiff
|   | data
|   |   |-- mouse-gencode-version-24/
|   |   |-- STAR_Genome_Index/TOPHAT_Genome_Index/
|   | reads
|   |   |-- rawreads
|   |   |-- rawreads_QA_stats
|   |   |-- filtered_reads
|   | tools
|   |   |-- DEXSeq/FastQC/STAR-2.7.3a/bedtools2/bowtie2/cufflinks/cutadapt/htseq/rseqqc/samtools/sratoolkit/tophat

BookChapter-RNA-Seq-Analyses
BookChapter-RNA-Seq-Analyses copied to clipboard

Metadata

Design and Analysis of RNA Sequencing Data

OS

Hardware

Software and packages

Reference based analysis:

Get the workflow from git using:

Workflow organization:

Brief Description of the Each Step

Download and install tools and python libraries (download_install_tools.sh):

Download Eukaryotic genome and its annotation (download_genome_and_annotation.sh)

Download the data and perform quality control (download_RNASeq_reads_and_QC.sh)

Perform reference alignment using genome sequence (reference_based_alignment.sh)

Quantification (quantification.sh)

Differential Expression (differential_expression.sh)

← Metadata

Owner

Metadata

BookChapter-RNA-Seq-Analyses BookChapter-RNA-Seq-Analyses copied to clipboard

Metadata

Design and Analysis of RNA Sequencing Data

OS

Hardware

Software and packages

Reference based analysis:

Get the workflow from git using:

Workflow organization:

Brief Description of the Each Step

Download and install tools and python libraries (download_install_tools.sh):

Download Eukaryotic genome and its annotation (download_genome_and_annotation.sh)

Download the data and perform quality control (download_RNASeq_reads_and_QC.sh)

Perform reference alignment using genome sequence (reference_based_alignment.sh)

Quantification (quantification.sh)

Differential Expression (differential_expression.sh)

← Metadata

Owner

Metadata

BookChapter-RNA-Seq-Analyses
BookChapter-RNA-Seq-Analyses copied to clipboard