sketchy
sketchy copied to clipboard
Genomic neighbor typing of bacterial pathogens using MinHash :rat:
sketchy data:image/s3,"s3://crabby-images/855c9/855c95802954a48efbfba764fb35501e52bb61bb" alt=""
Genomic neighbor typing for lineage and genotype inference
Overview
v0.6.0
Sketchy
is a lineage calling and genotyping tool based on the heuristic principle of genomic neighbor typing developed by Karel Břinda and colleagues (2020). It queries species-wide ('hypothesis-agnostic') reference sketches using MinHash and infers associated genotypes based on the closest match, including multi-locus sequence types, susceptibility profiles, virulence factors or other genome-associated features provided by the user. Unlike the original implementation in RASE
, sketchy
does not use phylogenetic trees which has some downsides, e.g. for sublineage genotype predictions (see below).
See the latest docs for install, usage and database building.
Install
Cargo:
cargo install sketchy
BioConda:
conda install -c bioconda sketchy
Release binaries available for download. Reference sketches can be constructed from local assembly and genotype collections. S. aureus reference sketches are available in the data availability section below.
Strengths and limitations
- Reference sketches and genotype indices can be constructed easily from large genotype collections
-
Sketchy
requires few resources when using small sketch sizes (s = 1000
) -
Sketchy
performs best on lineage predictions and lineage-wide genotypes from very few reads - we found that tens to hundreds of reads can often give a good idea of the close matches in the reference sketch (especially when inspecting the top matches using--top
)
However:
- Clade-specific genotype resolution is not as good as when using phylogenetic guide trees (
RASE
) - Sketch size can be increased to increase performance (
s = 10000
), but resources scale approximately linearly -
Sketchy
genotype inference may be difficult for species with high rates of homologous recombination
Data availability
- Reference sketches and genotype files (
s = 1000
,s = 10000
,k = 16
) for S. aureus (full genotypes including susceptibility predictions and other genotypes), S. pneumoniae, K. pneumoniae, P. aeruginosa and Neisseria spp. (MLST) can be found in the data repository. - Reference sketches for cross-validation on the simulated species data can be found in this data repository; genome assemblies for all species extracted from the ENA reference collection are available in this data repository
- Scripts to extract data from the ENA collections Grace Blackwell et al. and compute reference metrics can be found in the scripts directory.
- Nanopore reads for the outbreak isolates and genotype surveillance panels in Papua New Guinea (Flongle, Goroka, sequential protocol) are available for download in the data repository. Raw sequence data (Illumina / ONT) is being uploaded to NCBI (PRJNA657380).
Preprint
If you use sketchy
for research and other applications, please cite:
Steinig et al. (2022) - Genomic neighbor typing for bacterial outbreak surveillance - bioRxiv 2022.02.05.479210; doi: https://doi.org/10.1101/2022.02.05.479210