sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

2024 STAMPS tutorials - "running k-mer analyses with sourmash" venn & upset diagrams, ordination, and presence/absence diagram

Open ctb opened this issue 1 year ago • 3 comments
trafficstars

https://hackmd.io/9ORFRJGaTOiOdEAY-Aih2A?view


open lab / Sat + Monday , Jul 20 & 22, 2024 / STAMPS 2024

Suggestion for open lab:

  • run the commands below on the "canned" data;
  • diagram out where the information is coming from:
    • sourmash sketch dna creates files containing DNA k-mers;
    • the various comparisons look at overlaps and so on for the provided k-mer collections;
  • skim the docs on sourmash and the betterplot plugin and think about the various options;
  • ask questions about:
    • conda for software installation
    • trying the analyses on other public (or private) data;
    • doing other analyses (overlap, containment, etc.) with the k-mer sketches;

Getting started

Start up RStudio on your instance, and click on Terminal.

Setup (do these once)

Create a working directory

mkdir ~/smash-data/

Install sourmash and various plugins with conda (also see a conda tutorial; you'll need to install miniforge if you're not running conda/mamba on your Jetstream computer.)

Then run:

conda env create -f /opt/shared/sourmash-data/env-smash.yml \
    -n smash

This installs:

Change directory and activate environment (each time you log in)

Change to sourmash working directory and activate sourmash software environment:

cd ~/smash-data/
conda activate smash

Might as well check for updates, sourmash is fast moving ;).

pip install -U sourmash_plugin_betterplot

Convert some metagenomes into k-mer signatures

Use sourmash sketch to turn metagenomes into k-mers:

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947178_?.fastq.gz \
    -p k=31,abund -o SRR7947178.sig.zip --name SRR7947178

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947179_?.fastq.gz \
    -p k=31,abund -o SRR7947179.sig.zip --name SRR7947179

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947181_?.fastq.gz \
    -p k=31,abund -o SRR7947181.sig.zip --name SRR7947181

This will take about 5 minutes and create three SRR*.sig.zip files.

(You can grab any metagenome you want from ENA, e.g. SRR7947181; use curl -JLO <read URL to download the FASTQ file(s) and sketch them as above!)

Sketch some genomes into k-mer signatures

Use sourmash sketch to turn genomes into k-mers:

for i in /opt/shared/sourmash-data/10sketches/*.fa
do
    sourmash sketch dna $i -o genome-$(basename $i .fa).sig.zip \
        --name-from-first
done

sourmash sig cat genome*.sig.zip -o 10sketches.sig.zip

Run various analyses on the k-mer sketches!

Genome comparisons via venn and upset

Run:

sourmash scripts venn genome-{2,47,63}.sig.zip \
    -o genomes-venn.png --ident

to produce a file genomes-venn.png:

image

Run:

sourmash scripts upset genome-{2,47,63}.sig.zip \
    --show-singletons -o genomes-upset.png

to produce a file genomes-upset.png:

image

Genome distance matrix, dendrograms, and ordination

First compare the genomes:

sourmash compare 10sketches.sig.zip -o 10sketches.cmp \
    --labels-to 10sketches.labels_to.csv

Now build a matrix+dendrogram view:

sourmash scripts plot2 10sketches.cmp 10sketches.labels_to.csv \
    -o 10sketches.mat.png

to produce 10sketches.mat.png:

image

You can produce a metric MDS plot too, colored by species:

sourmash scripts mds 10sketches.cmp 10sketches.labels_to.csv \
    -o 10sketches.mds.png \
    -C /opt/shared/sourmash-data/10sketches/10sketches-categories.csv

to produce 10sketches.mds.png:

image

Flat (no abund) metagenome comparisons via venn and upset

Run a venn comparison of metagenomes => metag-venn.png.

sourmash scripts venn SRR*.sig.zip \
    -o metag-venn.png

image

Run an upset comparison of metagenomes => metag-upset.png

sourmash scripts upset SRR*.sig.zip \
    --show-singletons -o metag-upset.png

image

Genome presence/absence

Calculate which GTDB genomes are in a metagenome:

sourmash scripts fastgather SRR7947178.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip \
    -o SRR7947178.x.gtdb-rs214.csv

(will take ~3 minutes)

Make the detection/abundance plot:

sourmash scripts presence_filter SRR7947178.x.gtdb-rs214.csv \
    -o SRR7947178.detection.png

=> SRR7947178.detection.png:

image

sourmash scripts presence_filter SRR7947178.x.gtdb-rs214.csv \
    -o SRR7947178.ani.png \
    --ani

=> SRR7947178.ani.png

image

The three columns being plotted from SRR7947178.x.gtdb-rs214.csv are:

  • f_match_orig - the detection / containment of the match in the metagenome;
  • match_containment_ani - the containment-based ANI of the match;
  • average_abund - abundance of matching k-mers in the metagenome;

Generating taxonomic classifications for metagenomes with sourmash

The basic workflow is to first run sourmash gather as above, and then run sourmash tax metagenome.

Run gather for each metagenome:

sourmash scripts fastgather SRR7947179.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip \
    -o SRR7947179.x.gtdb-rs214.csv
    
sourmash scripts fastgather SRR7947181.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip \
    -o SRR7947181.x.gtdb-rs214.csv

Then you can run taxonomic analyses like so:

sourmash tax metagenome -g *.x.gtdb-rs214.csv  \
    -t /opt/shared/sourmash-data/gtdb-rs214.lineages.sqldb \
    -r order -F human

See the sourmash tax documentations for various output options!

Getting % of metagenome covered from gather

If you've run a fastgather above, you will still need to run sourmash gather to get nice human-readable output; you can do this quickly by using the fastgather output as a picklist to limit the results to the previous calculated output:

sourmash gather --picklist SRR7947179.x.gtdb-rs214.csv::prefetch \
    SRR7947179.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip

and you should then see at the end:

found 849 matches total;
the recovered matches hit 83.1% of the abundance-weighted query.
the recovered matches hit 68.7% of the query k-mers (unweighted).

Running abundhist to generate metagenome abundance histograms

After activating the smash conda environment, you can install the abundhist plugin like so:

pip install sourmash_plugin_abundhist

and run it on any metagenome or mixture that has been sketched with -p abund like so:

sourmash scripts abundhist SRR7947178.sig.zip -M 100 \
    --figure SRR7947178.abundhist.png

Other topics we can discuss!

  • Building your own sourmash database / extending the existing databases with your own private/unpublished genomes

Appendix 1: UNIX Command Line!

Totally new to the command-line, or want to strengthen your foundation? Do this Unix Crash Course.

If you want some experience in other aspects, consider going through these:

Appendix 2: Conda tutorial

Please see my conda tutorial; you'll need to install miniforge if you're not running things on your Jetstream computer.

Appendix 3: examining assembly overlap with k-mers

See my in-class notes.

Contact info

Contact Titus at [email protected].

ctb avatar Jul 21 '24 11:07 ctb

TODO: make setup instructions / files available somewhere!

ctb avatar Jul 21 '24 11:07 ctb

installation stuff/sourmash

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

conda env create -n smash-foo -f /opt/shared/sourmash-data/env-smash.yml

Appendix: sourmash env file

cat > env-smash.yml <<EOF
name: smash
channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - sourmash>=4.8.10,<5
    - seaborn
    - scikit-learn
    - seaborn
    - sourmash_plugin_branchwater==0.9.5
    - pip
    - pip:
      - sourmash_plugin_betterplot
      - sourmash_plugin_abundhist
EOF

Appendix 2

PatientHNC_06_illumina

curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/008/SRR7947178/SRR7947178_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/008/SRR7947178/SRR7947178_2.fastq.gz

PatientHNC_03_illumina

curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/009/SRR7947179/SRR7947179_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/009/SRR7947179/SRR7947179_2.fastq.gz

PatientHNC_10_illumina

curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/001/SRR7947181/SRR7947181_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/001/SRR7947181/SRR7947181_2.fastq.gz

Sketching

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947178_?.fastq.gz -p k=31,abund -o SRR7947178.sig.zip --name SRR7947178

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947179_?.fastq.gz -p k=31,abund -o SRR7947179.sig.zip --name SRR7947179

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947181_?.fastq.gz -p k=31,abund -o SRR7947181.sig.zip --name SRR7947181

ctb avatar Jul 21 '24 11:07 ctb

thoughts for next year -

  • emphasize "coverage of metagenome by references"
  • ...and then provide alternative suggestions for what to do if coverage is poor :)

ctb avatar Jul 23 '24 14:07 ctb