danbing-tk
danbing-tk copied to clipboard
Toolkit for VNTR genotyping and repeat-pan genome graph construction
danbing-tk
A toolkit for variable number tandem repeats (VNTRs) analysis, which enables:
- building repeat-pan genome graphs (RPGG) given haplotype-resolved assemblies for genome-wide profiling or simply VNTR alleles for targeted genotyping (referred to as
danbing-tk build
), - genotyping each VNTR as a set of (k-mer, count) given short-read sequencing (SRS) data (referred to as
danbing-tk align
), and - estimating VNTR or motif dosage from the genotype with bias correction (referred to as
danbing-tk predict
).
Our initial manuscript illustrates the key concept of this tool. The latest update details improvements on QC and bias correction, and extended applications on eQTL discoveries with motif compositions.
Download Releases
The latest release v1.3.1-manuscript comes with the lastest version of VNTR set, RPGG, and QC statistics.
File | Input of | Output of | |
---|---|---|---|
VNTR set | tr.good.bed |
danbing-tk build |
|
RPGG | pan.tr.kmers , pan.kmerDBi.umap , pan.kmerDBi.vv , pan.graph.umap |
danbing-tk align |
danbing-tk build or vntr2kmers_thread |
- Release v1.3.1-manuscript: provided QC statistics and all resources associated with the latest manuscript.
- Release v1.3: Updated RPGG built from 35 HGSVC genomes.
- Release v1.0: VNTR summary statistics and eGene discoveries are also included. Example analyses such as differential length/motif analysis, eQTL mapping, VNTR locus QC, sample QC are also included.
Building on Linux
git clone --recursive https://github.com/ChaissonLab/danbing-tk
cd danbing-tk && make -j 5
danbing-tk align
Decompress the RPGG RPGG.tar.gz
in your working directory.
An example usage to genotype SRS sample using the RPGG:
samtools fasta -@2 -n $SRS.bam |
/$PREFIX/danbing-tk/bin/danbing-tk -gc 85 3 -ae -kf 4 1 -cth 45 -o $OUT_PREF -k 21 -qs pan -fa /dev/stdin -p $THREADS | gzip >$OUT_PREF.aln.gz
danbing-tk align
takes ~12 cpu hours to genotype a 30x SRS sample. This will generate $OUT_PREF.tr.kmers
and $OUT_PREF.aln.gz
output with format specified in File Format.
Important note: If outputs of danbing-tk align
are intended to be compared across individuals e.g. association studies, please check the bias_correction notebook before running.
danbing-tk build
Install Dependencies
For users intend to use danbing-tk align
or the Scenario 1 of danbing-tk build
, this step is not required.
The danbing-tk build
pipeline and danbing-tk predict
require several external packages. It is recommended to install all requirements using conda as follows:
conda create -n $MY_ENVIRONMENT -c conda-forge -c bioconda \
python=3.11.4 snakemake=7.30.1 minimap2=2.26 samtools=1.17 bedtools=2.31.0 statsmodels=0.14.0 matplotlib=3.7.2
conda activate $MY_ENVIRONMENT
To check if everything is configured properly (tested on v1.3.2):
- Go to
/$PREFIX/danbing-tk/test/
- Replace
$PREFIX
ingoodPanGenomeGraph.json
andinput/genome.bam.tsv
with the path to danbing-tk - Run
snakemake -p -s ../pipeline/GoodPanGenomeGraph.snakefile -j 4 --forceall --output-wait 3
Running danbing-tk build
Scenario 1: building an RPGG for a single TR locus given VNTR alleles
-
Required inputs:
- VNTR alleles for each haplotype (one FASTA per haplotype)
-
Run
vntr2kmers_thread
with something like this:-
vntr2kmers_thread -g -k 21 -fs 700 -ntr 700 -on $NAME -fa $NUM_HAPLOTYPES $LIST_OF_FASTAS
- Note: at least 500 bp flanks are required for accurate mapping of pair-end reads, 700 bp was specified in the above example.
-
-
Index the graph as follows to use
danbing-tk align
later:-
/$PREFIX/danbing-tk/bin/ktools serialize $NAME
-
Scenario 2: building an RPGG for a VNTR set given assemblies
-
Required inputs:
- haplotype-resolved assemblies (FASTA)
- matched SRS data (BAM; optional)
- GRCh38 (FASTA; major chromosomes only without minor contigs)
- tandem repeat regions (BED;
tr.good.bed
from the release page or user-defined)
-
Copy
/$PREFIX/danbing-tk/pipeline/goodPanGenomeGraph.json
to your working directory and edit accordingly. -
A config file
/$INPUT_DIR/genome.bam.tsv
with two columns, one for genome name and one for bam file path, is required if SRS data is available for graph pruning, e.g.HG00514 /panfs/qcb-panasas/tsungyul/HG00514/HG00514.IL.srt.bam
Otherwise, set
pruning
ingoodPanGenomeGraph.json
toFalse
and use a single column input forgenome.bam.tsv
. -
Run the snakemake pipline with:
snakemake -p -s /$PREFIX/danbing-tk/pipeline/GoodPanGenomeGraph.snakefile -j 40\
--cluster "{params.copts} -c {resources.cores} --mem={resources.mem}G -k" \
--rerun-incomplete --restart-times 1 --output-wait 30
Submitting jobs to cluster is preferred as danbing-tk build
is compute-intensive, ~1200 cpu hours for the original dataset. Otherwise, remove --cluster
and its parameters to run jobs locally.
danbing-tk predict
Locus- and sample-specific biases are critical for normalizing the sum of k-mer counts to VNTR dosage (as a proxy for predicted length) and normalizing the average of k-mer counts to motif dosage. The bias for each locus in each sample is computed from the deviation of local read depth from the global mean given a set of invariant k-mers. Examples of this analysis can be found here
Automated bias correction has been added since v1.3.2. Invariant k-mer metadata ikmer.meta
(human readable version ikmer.meta.txt
) and example trkmers.meta.txt
can be found in Assets.
Example usage:
/$PREFIX/danbing-tk/bin/danbing-tk-pred trkmers.meta.txt ikmer.meta corrected.gt.tsv bias.tsv
Caveat: Estimated k-mer dosage could be inaccurate if the bias term is too close to zero.
Miscellaneous
Leave-one-out analysis
To evaluate the quality of custom RPGG on matching SRS dataset, copy /$PREFIX/danbing-tk/pipeline/leaveOneOut.snakefile
to your working directory and edit accordingly.
Run the snakemake pipleine with:
snakemake -p -s /$PREFIX/pipeline/LeaveOneOut.snakefile -j 40 --cluster \
"{params.copts} -c {resources.cores} --mem={resources.mem}G -k" \
--rerun-incomplete --restart-times 1 --output-wait 30
Submitting jobs to cluster is preferred as this analysis is compute-intensive; otherwise, remove --cluster
and its parameters to run jobs locally.
File Format
*.graph.kmers
>locus i
kmer0 out_edges0
kmer1 out_edges1
...
>locus i+1
...
out_edges
denotes the presence of T/G/C/A as the next nucleotide encoded with 4 bits.
*.(tr|ntr).kmers
>locus i
kmer0 kmer_count0
kmer1 kmer_count1
...
>locus i+1
...
The second field is optional.
Important Note: the output of danbing-tk align
do not contain locus info and the first field for minimal disk usage. The table can be reconstructed using the danbing_aln_output.tr_kmers.metadata.txt.gz
from metadata.tar.gz
on Zenodo
Alignment output (-a
option)
- Synopsis
src> <dest> <read_name> <read_seq/0> <read_seq/1> <ops/0> <annot/0> <ops/1> <annot/1>
-
src
: source locus of a read pair (for simulation only) -
dest
: aligned locus for the read pair -
ops
: nucleotide-level operations to align the read to the graph.size = read_len + #del
-
=
: a match -
X[A|C|G|T]
: a mismatch; letter in the graph is shown -
D[A|C|G|T]
: a deletion in the read; letter in the graph is shown -
I
: an insertion in the read -
*
: unalinged nucleotide
-
-
annot
: kmer-level VNTR annotations after applyingops
to the read.size = read_len - ksize + 1 + #del - #ins
-
=
: a match in the repeat -
.
: a match in the flank -
*
: unaligned kmer
-