sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

containment search tutorial: searching for genomes in metagenomes

Open ctb opened this issue 1 month ago • 0 comments

https://hackmd.io/tKpLr1ISR9mqHHmZbEvcow?view

containment search tutorial: mgmanysearch

Install relevant sourmash stuff

mamba create -y -n contain_tut sourmash_plugin_branchwater
mamba activate contain_tut

pip install sourmash_plugin_containment_search

Get data

Download 2.2 GB of compressed raw reads in IBD_tutorial_raw.tar.gz from the MintO tutorial data set:

curl -JLO https://zenodo.org/records/6369313/files/IBD_tutorial_raw.tar.gz?download=1
tar xzvf IBD_tutorial_raw.tar.gz

This will create a directory IBD_tutorial_raw; we're interested in the metaG/ reads underneath.

ls IBD_tutorial_raw/metaG/

Prepare two metagenome samples for search

Let's sketch two of the samples:

ls IBD_tutorial_raw/metaG/{CD136,CD237}

combine pairs, sketch with abundances:

sourmash sketch dna --name CD136 \
    -p abund,k=31 \
    IBD_tutorial_raw/metaG/CD136/*.fq.gz \
    -o CD136.sig.zip
sourmash sketch dna --name CD237 \
    -p abund,k=31 \
    IBD_tutorial_raw/metaG/CD237/*.fq.gz \
    -o CD237.sig.zip

These are now our metagenome sketches!

Grab a genome or two

curl -JL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/154/385/GCF_000154385.1_ASM15438v1/GCF_000154385.1_ASM15438v1_genomic.fna.gz \
    -o f_prausnitzii.fna.gz

curl -JL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/292/165/GCF_018292165.1_ASM1829216v1/GCF_018292165.1_ASM1829216v1_genomic.fna.gz \
    -o b_uniformis.fna.gz

Sketch genomes

Sketch them as well - no abundances needed:

sourmash sketch dna b_uniformis.fna.gz \
    --name "B. uniformis" \
    -o b_uniformis.sig.zip

sourmash sketch dna f_prausnitzii.fna.gz \
    --name "F. prausnitzii" \
    -o f_prausnitzii.sig.zip

Search!

Then run

sourmash scripts mgmanysearch \
    --queries f_prausnitzii.sig.zip b_uniformis.sig.zip \
    --against CD*.sig.zip

which should yield:

== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

Loaded 2 query signatures.

query             p_genome avg_abund   p_metag   metagenome name
--------          -------- ---------   -------   ---------------
F. prausnitzii       4.3%    24.7       2.5%     CD136
B. uniformis        48.3%     2.0       3.4%     CD136
F. prausnitzii       0.0%     0.0       0.0%     CD237
B. uniformis        62.6%    14.5      26.5%     CD237

Here,

  • p_genome is the fraction of the genome covered by the metagenome ("detection");
  • avg_abund is the average abundance of that genome in the metagenome (~mapping abundance);
  • p_metag is the fraction of the metagenome that would map to that genome;

More things that could be done

We have a much faster version of this in the works; this won't scale super well with dozens to hundreds of metagenomes.

We can provide ANI numbers between genome and metagenome if that's of interest.

Happy to chat about appropriate k-mer sizes and thresholds. This uses k=31 and no particular threshold; based on work here (Antarctic metagenome search paper) and elsewhere, I would expect that a threshold of > 99% of k-mers with a k-mer size of 51 would correspond to strain-resolved identity of the kind you're interested in.

ctb avatar Jul 01 '24 22:07 ctb