human_genomics_pipeline

A Snakemake workflow to process single samples (unrelated individuals) or cohort samples (related individuals) of paired-end sequencing data (WGS or WES) using bwa and GATK4. Quality control checks are also undertaken. The fastq files can be optionally trimmed with Trim Galore and the pipeline can run on NVIDIA GPU's where nvidia clara parabricks software is available for significant speedups in analysis times. This workflow is designed to follow the GATK best practice workflow for germline short variant discovery (SNPs + Indels). This pipeline is designed to be followed by vcf_annotation_pipeline and the data ingested into scout for clinical interpretation. However, this pipeline also stands on it's own, taking the data from fastq to vcf (raw sequencing data to called variants). This pipeline has been developed with human genetic data in mind, however we designed it to be species agnostic. Genetic data from other species can be analysed by setting a species-specific reference genome and variant databases in the configuration file (but not all situations have been tested).

human_genomics_pipeline
- Pipeline summary - single samples
- Pipeline summary - single samples - GPU accelerated
- Pipeline summary - cohort samples
- Pipeline summary - cohort samples - GPU accelerated
- Main output files
- Prerequisites
- Test human_genomics_pipeline
- Run human_genomics_pipeline
- Contribute back!

Pipeline summary - single samples

Raw read QC (FastQC and MultiQC)
Adapter trimming (Trim Galore) (optional)
Alignment against reference genome (Burrows-Wheeler Aligner)
Mark duplicates (GATK MarkDuplicates)
Base recalibration (GATK BaseRecalibrator and GATK ApplyBQSR)
Haplotype calling (GATK HaplotypeCalller)

Pipeline summary - single samples - GPU accelerated

Raw read QC (FastQC and MultiQC)
Adapter trimming (Trim Galore) (optional)
Alignment against reference genome, mark duplicates, base recalibration and haplotype calling (parabricks germline pipeline)
- Equivilant to Burrows-Wheeler Aligner, GATK MarkDuplicates, GATK BaseRecalibrator, GATK ApplyBQSR and GATK HaplotypeCalller

Pipeline summary - cohort samples

Raw read QC (FastQC and MultiQC)
Adapter trimming (Trim Galore) (optional)
Alignment against reference genome (Burrows-Wheeler Aligner)
Mark duplicates (GATK MarkDuplicates)
Base recalibration (GATK BaseRecalibrator and GATK ApplyBQSR)
Haplotype calling (GATK HaplotypeCalller)
Combine GVCF into multi-sample GVCF (GATK CombineGVCFs)
Genotyping (GATK GenotypeGVCFs)

Pipeline summary - cohort samples - GPU accelerated

Raw read QC (FastQC and MultiQC)
Adapter trimming (Trim Galore) (optional)
Alignment against reference genome, mark duplicates, base recalibration and haplotype calling (parabricks germline pipeline)
- Equivilant to Burrows-Wheeler Aligner, GATK MarkDuplicates, GATK BaseRecalibrator, GATK ApplyBQSR and GATK HaplotypeCalller
Combine GVCF into multi-sample GVCF (parabricks trio combine gvcf)
- Equivalent to GATK CombineGVCFs
Genotyping (GATK GenotypeGVCFs)

Main output files

Single samples:

results/qc/multiqc_report.html
results/mapped/sample1_recalibrated.bam
results/called/sample1_raw_snps_indels.vcf

Cohort samples:

results/qc/multiqc_report.html
results/mapped/sample1_recalibrated.bam
results/mapped/sample2_recalibrated.bam
results/mapped/sample3_recalibrated.bam
results/called/proband1_raw_snps_indels.vcf

Prerequisites

Prerequisite hardware: NVIDIA GPUs (for GPU accelerated runs)
Prerequisite software: NVIDIA CLARA parabricks and dependencies (for GPU accelerated runs), Git (tested with version 2.7.4), Mamba (tested with version 0.4.4) with Conda (tested with version 4.8.2), gsutil (tested with version 4.52), gunzip (tested with version 1.6)

Test human_genomics_pipeline

The provided test dataset can be used to test running this pipeline on a new machine, or test pipeline developments/releases.

Run human_genomics_pipeline

See the docs for a walkthrough guide for running human_genomics_pipeline on:

A single machine like a laptop or single server/computer
A high performance cluster

Contribute back!

Raise issues in the issues page
Create feature requests in the issues page
Start a discussion in the discussion page
Contribute your code! Create your own branch from the development branch and create a pull request to the development branch once the code is on point!

Contributions and feedback are always welcome! :blush:

human_genomics_pipeline
human_genomics_pipeline copied to clipboard

Metadata

human_genomics_pipeline

Pipeline summary - single samples

Pipeline summary - single samples - GPU accelerated

Pipeline summary - cohort samples

Pipeline summary - cohort samples - GPU accelerated

Main output files

Prerequisites

Test human_genomics_pipeline

Run human_genomics_pipeline

Contribute back!

← Metadata

Owner

Metadata

human_genomics_pipeline human_genomics_pipeline copied to clipboard

Metadata

human_genomics_pipeline

Pipeline summary - single samples

Pipeline summary - single samples - GPU accelerated

Pipeline summary - cohort samples

Pipeline summary - cohort samples - GPU accelerated

Main output files

Prerequisites

Test human_genomics_pipeline

Run human_genomics_pipeline

Contribute back!

← Metadata

Owner

Metadata

human_genomics_pipeline
human_genomics_pipeline copied to clipboard