parascopy
parascopy copied to clipboard
Copy number estimation and variant calling for duplicated genes using WGS.
Parascopy
Parascopy is designed for robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing.
Created by Timofey Prodanov timofey.prodanov[at]hhu.de and Vikas Bansal vibansal[at]ucsd.edu at the University of California San Diego.
Table of contents
- Citing Parascopy
- Installation
- General usage
- Input files
- Visualizing output
- Output files
- Precomputed data
- Test dataset
- Frequently asked questions
- Issues
- See also
Citing Parascopy
If you use Parascopy, please cite:
- Prodanov, T. & Bansal, V. Robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing. Nature Communications 13, 3221 (2022). https://doi.org/10.1038/s41467-022-30930-3
- Prodanov, T. & Bansal, V. A multi-locus approach for accurate variant calling in low-copy repeats using whole-genome sequencing. Bioinformatics 39, i279-i287 (2023). https://doi.org/10.1093/bioinformatics/btad268
Installation
Parascopy is written in Python (≥ v3.6) and is available on Linux and macOS.
You can install Parascopy using conda:
conda config --add channels bioconda
conda config --add channels conda-forge
conda install -c bioconda parascopy
In most cases, it takes 2-6 minutes to install Parascopy using conda.
If you have problems with installing Parascopy using conda, it is possible to create a new conda environment
(see more details here):
conda create --name paras_env parascopy
conda activate paras_env
Alternatively, you can install it manually using the following commands:
git clone https://github.com/tprodanov/parascopy.git
cd parascopy
python3 setup.py install
Parascopy depends on several Python modules (see here). To install Parascopy without dependencies, you can run
python3 setup.py develop --no-deps
Additionally, you can specify installation path using --prefix <path>.
In case of manual installation, Parascopy requires
In order to run variant calling (parascopy call), Parascopy requires a modified Freebayes executable.
It is installed automatically via conda, but for manual installation you need to run
# Within the main parascopy directory.
git clone --recursive https://github.com/tprodanov/freebayes
cd freebayes
./compile.sh
cd ..
General usage
See parascopy help or parascopy <command> --help for more information.
Additionally, you can find a test dataset and pipeline here.
Homology table
Parascopy uses a database of duplications - homology table. To construct a homology table you would need to run:
parascopy pretable -f genome.fa -o pretable.bed.gz
parascopy table -i pretable.bed.gz -f genome.fa -o table.bed.gz
Note, that the reference genome should be indexed with both samtools faidx and bwa index.
Alternatively, you can download a precomputed homology table.
Estimating copy number
In order to find aggregate and paralog-specific copy number profiles (agCN and psCN), you can run the following commands:
# Calculate background read depth.
parascopy depth -I input.list -g hg38 -f genome.fa -o depth
# Estimate agCN and psCN for multiple samples.
parascopy cn -I input.list -t table.bed.gz -f genome.fa -R regions.bed -d depth -o analysis1
# Estimate agCN and psCN for single/multiple samples using model parameters from a previous run.
parascopy depth -I input2.list -g hg38 -f genome.fa -o depth2
parascopy cn-using analysis1/model -I input2.list -t table.bed.gz -f genome.fa -d depth2 -o analysis2
Calling variants
Parascopy variant calling in duplicated regions is available since version v1.9.
Currently, variant calling is performed on top of an existing copy number analysis.
To run variant calling you can execute the following command:
parascopy call -p analysis1 -f genome.fa -t table.bed.gz
Input files
Input alignment files should be sorted and indexed, both .bam and .cram formats are supported.
Input filenames can be passed into Parascopy as -i in1.bam in2.bam ... or as -I input-list.txt,
where input-list.txt is a text file with a single filename on each line.
Additionally, you can provide/override sample names with
-i in1.bam::sampleA in2.bam::sampleB or, in case of -I input-list.txt, as a second entry on each line:
in1.bam sampleA
in2.bam sampleB
Modifying reference copy number
After agCN estimation, parascopy searches for reliable PSVs.
However, only samples that are consistent with the reference copy number are used (in a two-copy duplication,
only samples with agCN = 4 will be used).
Additionally, by default, sex chromosomes X and Y are treated as regular chromosomes.
To solve both issues, one can use --modify-ref argument and provide updated reference copy numbers for some samples
and some regions.
The BED file can contain the following lines:
#CHROM START END SAMPLES CN # This line is not necessary.
chr1 10000 20000 * 0 # It is known that the region is missing from all samples.
chrX 0 inf SAMPLE1,SAMPLE2,SAMPLE3 1 # Male samples have one chrX and one chrY.
chrY 0 inf SAMPLE1,SAMPLE2,SAMPLE3 1
chrY 0 inf SAMPLE4,SAMPLE5,SAMPLE6 0 # Female samples are missing chrY.
Visualizing output
It is possible to visualize agCN and psCN detection process.
To do that you need to clone this repository and run scripts in draw directory.
The scripts are written in R language and require a number of R packages:
install.packages(c('argparse', 'tidyverse', 'ggplot2', 'ComplexHeatmap', 'viridis', 'circlize', 'ggthemes', 'RColorBrewer'))
Output files
Copy number variation analysis (parascopy cn|cn-using) produces a range of files, described here.
Variant calling produces regular VCF files, additional information can be found here.
Precomputed data
Precomputed homology tables for GRCh37, GRCh38 and CHM13 reference genomes can be found here. The same link also contains precomputed model parameters for 168 disease-associated loci and 5 continental populations.
Compatible reference genomes (without the ALT contigs) can be dowloaded from
Note that if the BAM/CRAM files contain ALT contigs, you should provide modified reference copy number values
using --modify-ref argument, where all ALT contigs have the reference copy number 0.
Additionally, you should use homology table constructed on a reference file with ALT contigs.
Test dataset
You can find the full test pipeline here. Alternatively, you can follow step-by-step instructions below:
First, place reference human genome (hg38), precomputed homology tables and model parameters in data directory.
For single-sample analysis, we subsampled HG00113 human genome, which can downloaded
here (360 Mb).
Please, index the cram file using samtools index HG00113.cram.
We start by calculating background read depth, which should take 2-5 minutes:
parascopy depth -i HG00113.cram -f data/hg38.fa -g hg38 -o depth_HG00113
Next, we calculate copy number profiles for 167 duplicated loci using precomputed model parameters
obtained using 503 European samples.
This step takes 10-40 minutes depending on the number of threads (controlled by -@ N).
parascopy cn-using data/models_v1.2.5/EUR \
-i HG00113.cram -f data/hg38.fa -t data/homology_table/hg38.bed.gz \
-d depth_HG00113 -o parascopy_HG00113
You can analyze a subset of loci by specifying data/models_v1.2.5/EUR/<locus>.gz,
for example analysis of the SMN1 locus should take less than a minute.
For multi-sample analysis, we extracted reads aligned to the GBA/GBAP1 locus for 503 European samples,
can be downloaded here (195 Mb)
(extract using tar xf GBA.tar.gz).
Background read depth is already calculated and is located in GBA/1kgp_depth.csv.gz.
Calculating copy number profiles for the GBA/GBAP1 locus should take around 25-30 minutes using a single core.
parascopy cn -I GBA/input.list -f data/hg38.fa \
-t data/homology_table/hg38.bed.gz -d GBA/1kgp_depth.csv.gz \
-r chr1:155231479-155244699::GBA -o parascopy_GBA
Here, the region is supplied using the -r chr:start-end[::name] format,
alternatively you can supply regions in a BED file using the -R argument (optional: fourth column with region names).
Sample output can be found here (251 Mb), output files description can be found here.
Frequently asked questions
You can find FAQ here.
Issues
Please submit issues here or send them to timofey.prodanov[at]gmail.com.
See also
Additionally, you may be interested in these tools: