strainy icon indicating copy to clipboard operation
strainy copied to clipboard

Graph-based assembly phasing

CC BY-NC-SA 4.0

Strainy

Strainy is a graph-based phasing algorithm, that takes a de novo assembly graph (in gfa format) and simplifies it by combining phasing information and graph structure.

Screenshot 2023-01-30 at 16 47 16

Conda Installation

The recommended way of installing is through conda:

git clone https://github.com/katerinakazantseva/stRainy
cd stRainy
git submodule update --init
make -C submodules/Flye
conda env create -f environment.yml -n strainy

Note that if you use an M1 conda installation, you should run conda config --add subdirs osx-64 before installation. Find details here

Once installed, you will need to activate the conda environment prior to running:

conda activate strainy
./strainy.py -h

Quick usage example

After successful installation, you should be able to run:

conda activate strainy
./strainy.py -g test_set/toy.gfa -q test_set/toy.fastq.gz -o out_strainy -m hifi 

Input requirements

Strainy supports PacBio HiFi and Nanopore (Guppy5+) sequencing.

The two main inputs to Strainy are:

  1. GFA file: A de novo metagenomic assembly that can be produced with metaFlye or minigraph. For metaFlye parameters, please see Improving de novo metagenomic assemblies below.
  2. FASTQ file containing reads to be aligned to the fasta reference generated from the GFA file).

Improving de novo metagenomic assemblies

We have developed Strainy using metaFlye metagenomic assembly graphs as input. The recommended set of parameters is --meta --keep-haplotypes --no-alt-contigs -i 0.

Note that -i 0 disables metaFlye's polishing procedure, which we found to improve read assignment to bubble branches during minimap2 realignment. --keep-haplotypes retains structural variations between strains on the assembly graph. --no-alt-contigs disables the output of "alternative" contigs, which can later confuse the read aligner.

Usage and outputs

Strainy has 2 stages: phase and transform. By default, Strainy will perform both. Please see Parameter Description section for the full list of available arguments:

./strainy.py -g [gfa_file] -q [fastq_file] -m [mode] -o [output_dir]

1. phase stage performs read clustering, and produces csv files detailing these clusters. A bam file is also produced, which can be used to visualize the clusters.

Screenshot 2023-01-30 at 17 01 47


2. transform stage transforms and simplifies the initial assembly graph, producing the strain resolved gfa file: strain_unitigs.gfa

Screenshot 2023-01-30 at 16 45 20

Parameter description

Argument Description
-o, --output Output directory
-g, --gfa Input assembly graph (.gfa) (may be produced with metaFlye or minigraph)
-q, --fastq FASTQ file containing reads ( PacBio HiFi or Nanopore sequencing)
-m, --mode Type of the reads {hifi,nano}
-s, --stage (Optional) Stage to run: phase, transform or e2e (phase + transform) (default: e2e)
--snp (Optional) .vcf file, with variants of the desired allele frequency. If not provided, Strainy will use the built-in pileup-based caller
-b, --bam (Optional) .bam file generated by aligning the input reads to the input graph, minimap2 will be used to generate a .bam file if not provided
-a, --allele-frequency (Optional) Allele frequency threshold for built-in pileup-based caller. Will only work if --snp is not used (default: None)
-d, --cluster-divergence (Optional) The maximum number of total mismatches allowed in the cluster per 1 kbp. Should be selected depending on SNP rates and their accuracy. Higher values can reduce high fragmentation at the cost of clustering accuracy (default: None)
--unitig-split-length (Optional) The length (in kb) which the unitigs that are longer will be split, set 0 to disable (default: 50 kb)
--min-unitig-coverage (Optional) The minimum coverage threshold for phasing unitigs, unitigs with lower coverage will not be phased (default: 20)
--max-unitig-coverage (Optional) The maximum coverage threshold for phasing unitigs, unitigs with higher coverage will not be phased (default: 500)
-t, --threads (Optional) Number of threads to use (default: 4)
--debug (Optional) Enables debug mode for extra logs and output

Output description

strain_contigs.gfa

phased graph (before simplifying links and merging contigs)

strain_unitigs.gfa

phased graph (after simplifying links and merging contigs)

strain_variants.vcf

vcf produced by Strainy build-in caller if not provided by user

alignment_phased.bam

alignment (input reads to the input gfa) if not provided by user

multiplicity_stats.txt

output statistics file (multiplicity and strain divergence info)

phased_unitig_info_table.csv

output statistics file (Length,Coverage, SNP rate) for phased unitigs

reference_unitig_info_table.csv

output statistics file (Length,Coverage, SNP rate) for reference unitigs

Acknowledgements

Consensus function of Strainy is Flye

Community detection algorithm is Karate club

Contributers

Strainy was originally developed at at Kolmogorov lab at NCI

Code contributors:

  • Ekaterina Kazantseva
  • Ataberk Donmez
  • Mikhail Kolmogorov

Citation

Ekaterina Kazantseva, Ataberk Donmez, Mihai Pop, Mikhail Kolmogorov. "Strainy: assembly-based metagenomic strain phasing using long reads" bioRxiv 2023, https://doi.org/10.1101/2023.01.31.526521

License

Shield: CC BY-NC-SA 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0