nPoRe
nPoRe copied to clipboard
nPoRe: n-Polymer Realigner for improved pileup-based variant calling
nPoRe: n-Polymer Realigner for improved pileup variant calling
Introduction
npore is a read realigner which recalculates each read's fine-grained alignment in order to more accurately align ''n-polymers'' such as homopolymers (n=1) and tandem repeats (2 ≤ n ≤ 6). In other words, given an input BAM, it adjusts each read's CIGAR string to more accurately model the most likely sequencing errors and actual variants. Traditional affine gap penalties are context-agnostic, and do not model the higher likelihood of INDELs in low-complexity regions (particularly n-polymers), leading to poor or inconsistent alignments. We find that npore improves pileup concordance across reads and results in slightly better variant calling performance.
Citation
Please cite the following pre-print if you use npore:
[bioRxiv] nPoRe: n-Polymer Realigner for improved pileup variant calling
@article {dunn-npore,
author = {Dunn, Tim and Blaauw, David and Das, Reetuparna and Narayanasamy, Satish},
title = {nPoRe: n-Polymer Realigner for improved pileup variant calling},
elocation-id = {2022.02.15.480561},
year = {2022},
doi = {10.1101/2022.02.15.480561},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2022/02/18/2022.02.15.480561},
eprint = {https://www.biorxiv.org/content/early/2022/02/18/2022.02.15.480561.full.pdf},
journal = {bioRxiv}
}
Contents
- Introduction
- Installation
- Usage
- Project Structure
- Data Sources
- Acknowledgements
Installation
Option 1: GitHub Source
First, clone the repository:
git clone https://github.com/timd1/npore && cd npore
Next, set up a virtual environment, activate it, and install the required packages.
python3 -m venv venv3 --prompt "npore"
source ./venv3/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
Please ensure this environment is activated when building or running npore.
Lastly, build npore and verify that it has succeeded.
make
python3 ./src/realign.py --help
Option 2: Docker Hub Image
A pre-built Docker image can be downloaded from here using:
sudo docker pull timd1/npore
sudo docker run -it timd1/npore:latest python3 realign.py --help
Option 3: Dockerfile
This may take some time to re-build the image; the previous option should be preferred in most cases.
git clone https://github.com/TimD1/npore && cd npore
sudo docker build -f ./Dockerfile -t timd1/npore:latest .
sudo docker run -it timd1/npore:latest python3 realign.py --help
Usage
Prerequisites
All input BAMs are required to:
- Have
MDtag annotations (forpysamread to ref mapping) - Be indexed (have an associated
.bam.baifile)
You can prepare your input BAM using samtools:
samtools calmd -@ `nproc` -b -Q orig_reads.bam ref.fasta > reads.bam
samtools index reads.bam
Option 1: GitHub Source
Here's an example usage of the main realign.py program, which will store results in realigned.sam.
export NPORE="$HOME/npore"
export DATA="$NPORE/test/data"
. $NPORE/venv3/bin/activate
python3 $NPORE/src/realign.py \
--bam $DATA/reads.bam \
--ref $DATA/ref.fasta \
--out_prefix $DATA/realigned \
--stats_dir $NPORE/guppy5_stats
For additional options, run python3 realign.py --help.
Option 2: Docker
Here's how to call the Docker container with the same arguments as above:
export NPORE="$HOME/npore"
export DATA="$NPORE/test/data"
sudo docker run \
-v $DATA:$DATA \
-v $NPORE/guppy5_stats:$NPORE/guppy5_stats \
timd1/npore:v0.1.0 \
python3 realign.py \
--bam $DATA/reads.bam \
--ref $DATA/ref.fasta \
--out_prefix $DATA/realigned \
--stats_dir $NPORE/guppy5_stats
Project Structure
src/ |
nPoRe source code |
|---|---|
realign.py |
Module for realigning a BAM file. |
standardize_vcf.py |
Module for standardizing a ground-truth VCF file to report variants in the same manner that nPoRe would align the reads. |
bed.py |
Module for computing n-polymer BED regions. |
purity.py |
Module for computing a BAM pileup's Gini purity, for measuring read concordance. |
filter.py |
Simple module for filtering overlapping variants. |
cfg.py |
Contains global variables and configuration. |
All other src/ files (aln.pyx, bam.pyx, cig.pyx, vcf.py, util.py) contain functions used in the above modules.
scripts/ |
Helper scripts used during evaluation |
|---|---|
realign_pipeline.sh |
Main Clair3 retraining pipeline. |
happy.sh |
Runs hap.py evaluation of all configurations/regions. |
plot_results.py |
Plots final precision/recall graphs. |
plot_sankey.py |
Generates Sankey plot of actual/error INDELs by n-polymer BED region. |
calc_beds.sh |
Calculates n-polymer BEDs, running bed.py. |
sankey.py |
Custom Sankey plot library, extended from pySankey. |
purity.sh |
Calculates Gini purity. |
align.sh |
Aligns reads to a reference, allowing multiple input formats. |
tag_unphased.py |
Tags unphased reads with HP:i:0. |
test/ |
Testing directory |
|---|---|
align.py |
Tests align() kernel. |
get_np_info.py |
Tests n-polymer info generation. |
realign.sh |
Tests full read realignment. |
test_std_vcf.sh |
Tests VCF standardization. |
profile_alignment.ipynb |
Line-by-line profiling of align() kernel. |
*stats/ |
Directory storing cached confusion matrices |
|---|
Data Sources
The Genome In A Bottle GRCh38 v4.1 ground truth VCF and benchmarking regions were downloaded from here. The GRCh38 human reference sequence and R9.4.1 reads basecalled with the Guppy 5.0.6 super-accuracy model were downloaded from ONT Open Datasets.
Acknowledgements
We would like to thank the developers of samtools, minimap2, pysam, clair3, pepper-deepvariant, pysankey, igv, and swalign. We would also like to thank GIAB and ONT for making their data available publicly.