layout: default title: Home nav_order: 1 description: "Quick start guides for bioinformatics programs, with video demonstrations and scripts." permalink: /

Bioinformatics Notebook

GitHub issues GitHub repo size Website

This project provides introductions to various bioinformatics tools with short guides, video demonstrations, and scripts that tie these tools together. The documents in this project can be read locally in a plain-text editor, or viewed online at https://rnnh.github.io/bioinfo-notebook/. If you are not familiar with using programs from the command line, begin with the page "Introduction to the command line". If you have any questions, or spot any mistakes, please submit an issue on GitHub.

Pipeline examples
Contents
Installation instructions
Repository structure

Pipeline examples

These bioinformatics pipelines can be carried out using scripts and tools described in this project. Input files for some of these scripts can be specified in the command line; other scripts will need to be altered to fit the given input data.

SNP analysis

FASTQ reads from whole genome sequencing (WGS) can be assembled using SPAdes.
Sequencing reads can be aligned to this assembled genome using bowtie2.
The script snp_calling.sh aligns sequencing reads to an assembled genome and detects single nucleotide polymorphisms (SNPs). This will produce a Variant Call Format (VCF) file.
The proteins in the assembled reference genome- the genome to which the reads are aligned- can be annotated using genome_annotation_SwissProt_CDS.sh.
The genome annotation GFF file can be cross-referenced with the VCF file using annotating_snps.R. This will produce an annotated SNP format file.
Annotated SNP format files can be cross-referenced using annotated_snps_filter.R. For two annotated SNP files, this script will produce a file with annotated SNPs unique to the first file, and a file with annotated SNPs unique to the second file.

RNA-seq analysis

fastq-dump_to_featureCounts.sh can be used to download RNA-seq reads from NCBI's Sequence Read Archive (SRA) and align them to a reference genome. This script uses fastq-dump or fasterq-dump to download the sequencing reads as FASTQ, and featureCounts to align them to a reference FASTA nucleotide file.
Running fastq-dump_to_featureCounts.sh will produce feature count tables. These feature count tables can be combined using combining_featCount_tables.py.
These combined feature count tables can be used for differential expression (DE) analysis. An example DE analysis script is included in this project: DE_analysis_edgeR_script.R. This script uses the R programming language with the edgeR library.

Detecting orthologs between genomes

Augustus can be used to predict genes from FASTA nucleotide files.
Once the FASTA amino acid sequences have been extracted from the Augustus annotations, you can search for orthologs using OrthoFinder.
To find a specific gene of interest, search the amino acid sequences of the predicted genes using BLAST.

1. General guides

Introduction to the command line
Windows Subsystem for Linux
Using Ubuntu through a Virtual Machine
File formats used in bioinformatics

2. Program guides

Augustus
Bcftools
BLAST
Bowtie
Bowtie2
Conda
Fasterq-dump
Fastq-dump
FeatureCounts
Htseq-count
OrthoFinder
SAMtools
sgRNAcas9
SPAdes

3. Scripts

Annotated SNPs filter
Annotating SNPs
Combining featCount tables.py
DE_analysis_edgeR_script.R
Fastq-dump to featureCounts
Genome annotation script
Linux setup script
SNP calling script
UniProt downloader

Installation instructions

After following these instructions, there will be a copy of the bioinfo-notebook GitHub repo on your system in the ~/bioinfo-notebook/ directory. This means there will be a copy of all the documents and scripts in this project on your computer. If you are using Linux and run the Linux setup script, the bioinfo-notebook virtual environment- which includes the majority of the command line programs covered in this project- will also be installed using conda.

1. This project is written to be used through a UNIX (Linux or Mac with macOS Mojave or later) operating system. If you are using a Windows operating system, begin with these pages on setting up Ubuntu (a Linux operating system):

Windows Subsystem for Linux
Using Ubuntu through a Virtual Machine

Once you have an Ubuntu system set up, run the following command to update the lists of available software:

$ sudo apt-get update # Updates lists of software that can be installed

2. Run the following command in your home directory (~) to download this project:

$ git clone https://github.com/rnnh/bioinfo-notebook.git

3. If you are using Linux, run the Linux setup script with this command after downloading the project:

$ bash ~/bioinfo-notebook/scripts/linux_setup.sh

Video demonstration of installation

Repository structure

bioinfo-notebook/
├── assets/
│   └── bioinfo-notebook_logo.svg
├── data/
│   ├── blastx_SwissProt_example_nucleotide_sequence.fasta.tsv
│   ├── blastx_SwissProt_S_cere.tsv
│   ├── design_table.csv
│   ├── example_genome_annotation.gtf
│   ├── example_nucleotide_sequence.fasta
│   └── featCounts_S_cere_20200331.csv
├── docs/
│   ├── annotated_snps_filter.md
│   ├── annotating_snps.md
│   ├── augustus.md
│   ├── blast.md
│   ├── bowtie2.md
│   ├── bowtie.md
│   ├── cl_intro.md
│   ├── cl_solutions.md
│   ├── combining_featCount_tables.md
│   ├── conda.md
│   ├── DE_analysis_edgeR_script.md
│   ├── DE_analysis_edgeR_script.pdf
│   ├── fasterq-dump.md
│   ├── fastq-dump.md
│   ├── fastq-dump_to_featureCounts.md
│   ├── featureCounts.md
│   ├── file_formats.md
│   ├── genome_annotation_SwissProt_CDS.md
│   ├── htseq-count.md
│   ├── linux_setup.md
│   ├── orthofinder.md
│   ├── part1.md    # Navigation page for website
│   ├── part2.md    # Navigation page for website
│   ├── part3.md    # Navigation page for website
│   ├── report_an_issue.md
│   ├── samtools.md
│   ├── sgRNAcas9.md
│   ├── snp_calling.md
│   ├── SPAdes.md
│   ├── ubuntu_virtualbox.md
│   ├── UniProt_downloader.md
│   └── wsl.md
├── envs/            # conda environment files
│   ├── augustus.yml            # environment for Augustus
│   ├── bioinfo-notebook.txt
│   ├── bioinfo-notebook.yml
│   ├── orthofinder.yml         # environment for OrthoFinder
│   └── sgRNAcas9.yml           # environment for sgRNAcas9
├── scripts/
│   ├── annotated_snps_filter.R
│   ├── annotating_snps.R
│   ├── combining_featCount_tables.py
│   ├── DE_analysis_edgeR_script.R
│   ├── fastq-dump_to_featureCounts.sh
│   ├── genome_annotation_SwissProt_CDS.sh
│   ├── linux_setup.sh
│   ├── snp_calling.sh
│   └── UniProt_downloader.sh
├── _config.yml     # Configures github.io project website
├── .gitignore
├── LICENSE
├── README.md
└── .travis.yml     # Configures Travis CI testing for GitHub repo

bioinfo-notebook
bioinfo-notebook copied to clipboard

Metadata

layout: default title: Home nav_order: 1 description: "Quick start guides for bioinformatics programs, with video demonstrations and scripts." permalink: /

Bioinformatics Notebook

Pipeline examples

SNP analysis

RNA-seq analysis

Detecting orthologs between genomes

Contents

1. General guides

2. Program guides

3. Scripts

Installation instructions

Video demonstration of installation

Repository structure

← Metadata

Owner

Metadata

bioinfo-notebook bioinfo-notebook copied to clipboard

Metadata

layout: default title: Home nav_order: 1 description: "Quick start guides for bioinformatics programs, with video demonstrations and scripts." permalink: /

Bioinformatics Notebook

Pipeline examples

SNP analysis

RNA-seq analysis

Detecting orthologs between genomes

Contents

1. General guides

2. Program guides

3. Scripts

Installation instructions

Video demonstration of installation

Repository structure

← Metadata

Owner

Metadata

bioinfo-notebook
bioinfo-notebook copied to clipboard