strainy
strainy copied to clipboard
Graph-based assembly phasing
Strainy
Strainy is a graph-based phasing algorithm, that takes a de novo assembly graph (in gfa format) and simplifies it by combining phasing information and graph structure.
Conda Installation
The recommended way of installing is through conda:
git clone https://github.com/katerinakazantseva/stRainy
cd stRainy
git submodule update --init
make -C submodules/Flye
conda env create -f environment.yml -n strainy
Note that if you use an M1 conda installation, you should run conda config --add subdirs osx-64
before installation.
Find details here
Once installed, you will need to activate the conda environment prior to running:
conda activate strainy
./strainy.py -h
Quick usage example
After successful installation, you should be able to run:
conda activate strainy
./strainy.py -g test_set/toy.gfa -q test_set/toy.fastq.gz -o out_strainy -m hifi
Input requirements
Strainy supports PacBio HiFi and Nanopore (Guppy5+) sequencing.
The two main inputs to Strainy are:
- GFA file: A de novo metagenomic assembly that can be produced with metaFlye or minigraph. For metaFlye parameters, please see Improving de novo metagenomic assemblies below.
- FASTQ file containing reads to be aligned to the fasta reference generated from the GFA file).
Improving de novo metagenomic assemblies
We have developed Strainy using metaFlye metagenomic assembly graphs as input. The recommended
set of parameters is --meta --keep-haplotypes --no-alt-contigs -i 0
.
Note that -i 0
disables metaFlye's polishing procedure, which we found to improve read assignment
to bubble branches during minimap2
realignment. --keep-haplotypes
retains structural
variations between strains on the assembly graph. --no-alt-contigs
disables the output of
"alternative" contigs, which can later confuse the read aligner.
Usage and outputs
Strainy has 2 stages: phase and transform. By default, Strainy will perform both. Please see Parameter Description section for the full list of available arguments:
./strainy.py -g [gfa_file] -q [fastq_file] -m [mode] -o [output_dir]
1. phase stage performs read clustering, and produces csv files detailing these clusters. A bam file is also produced, which can be used to visualize the clusters.
2. transform stage transforms and simplifies the initial assembly graph, producing the strain resolved gfa file: strain_unitigs.gfa
Parameter description
Argument | Description |
---|---|
-o, --output | Output directory |
-g, --gfa | Input assembly graph (.gfa) (may be produced with metaFlye or minigraph) |
-q, --fastq | FASTQ file containing reads ( PacBio HiFi or Nanopore sequencing) |
-m, --mode | Type of the reads {hifi,nano} |
-s, --stage (Optional) | Stage to run: phase, transform or e2e (phase + transform) (default: e2e) |
--snp (Optional) | .vcf file, with variants of the desired allele frequency. If not provided, Strainy will use the built-in pileup-based caller |
-b, --bam (Optional) | .bam file generated by aligning the input reads to the input graph, minimap2 will be used to generate a .bam file if not provided |
-a, --allele-frequency (Optional) | Allele frequency threshold for built-in pileup-based caller. Will only work if --snp is not used (default: None) |
-d, --cluster-divergence (Optional) | The maximum number of total mismatches allowed in the cluster per 1 kbp. Should be selected depending on SNP rates and their accuracy. Higher values can reduce high fragmentation at the cost of clustering accuracy (default: None) |
--unitig-split-length (Optional) | The length (in kb) which the unitigs that are longer will be split, set 0 to disable (default: 50 kb) |
--min-unitig-coverage (Optional) | The minimum coverage threshold for phasing unitigs, unitigs with lower coverage will not be phased (default: 20) |
--max-unitig-coverage (Optional) | The maximum coverage threshold for phasing unitigs, unitigs with higher coverage will not be phased (default: 500) |
-t, --threads (Optional) | Number of threads to use (default: 4) |
--debug (Optional) | Enables debug mode for extra logs and output |
Output description
strain_contigs.gfa
phased graph (before simplifying links and merging contigs)
strain_unitigs.gfa
phased graph (after simplifying links and merging contigs)
strain_variants.vcf
vcf produced by Strainy build-in caller if not provided by user
alignment_phased.bam
alignment (input reads to the input gfa) if not provided by user
multiplicity_stats.txt
output statistics file (multiplicity and strain divergence info)
phased_unitig_info_table.csv
output statistics file (Length,Coverage, SNP rate) for phased unitigs
reference_unitig_info_table.csv
output statistics file (Length,Coverage, SNP rate) for reference unitigs
Acknowledgements
Consensus function of Strainy is Flye
Community detection algorithm is Karate club
Contributers
Strainy was originally developed at at Kolmogorov lab at NCI
Code contributors:
- Ekaterina Kazantseva
- Ataberk Donmez
- Mikhail Kolmogorov
Citation
Ekaterina Kazantseva, Ataberk Donmez, Mihai Pop, Mikhail Kolmogorov. "Strainy: assembly-based metagenomic strain phasing using long reads" bioRxiv 2023, https://doi.org/10.1101/2023.01.31.526521
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.