sourmash
sourmash copied to clipboard
2024 STAMPS tutorials - "running k-mer analyses with sourmash" venn & upset diagrams, ordination, and presence/absence diagram
https://hackmd.io/9ORFRJGaTOiOdEAY-Aih2A?view
open lab / Sat + Monday , Jul 20 & 22, 2024 / STAMPS 2024
Suggestion for open lab:
- run the commands below on the "canned" data;
- diagram out where the information is coming from:
sourmash sketch dnacreates files containing DNA k-mers;- the various comparisons look at overlaps and so on for the provided k-mer collections;
- skim the docs on sourmash and the betterplot plugin and think about the various options;
- ask questions about:
- conda for software installation
- trying the analyses on other public (or private) data;
- doing other analyses (overlap, containment, etc.) with the k-mer sketches;
Getting started
Start up RStudio on your instance, and click on Terminal.
Setup (do these once)
Create a working directory
mkdir ~/smash-data/
Install sourmash and various plugins with conda (also see a conda tutorial; you'll need to install miniforge if you're not running conda/mamba on your Jetstream computer.)
Then run:
conda env create -f /opt/shared/sourmash-data/env-smash.yml \
-n smash
This installs:
- sourmash
- the branchwater plugin for sourmash
- the betterplot plugin for sourmash
- the containment search plugin for sourmash
Change directory and activate environment (each time you log in)
Change to sourmash working directory and activate sourmash software environment:
cd ~/smash-data/
conda activate smash
Might as well check for updates, sourmash is fast moving ;).
pip install -U sourmash_plugin_betterplot
Convert some metagenomes into k-mer signatures
Use sourmash sketch to turn metagenomes into k-mers:
sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947178_?.fastq.gz \
-p k=31,abund -o SRR7947178.sig.zip --name SRR7947178
sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947179_?.fastq.gz \
-p k=31,abund -o SRR7947179.sig.zip --name SRR7947179
sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947181_?.fastq.gz \
-p k=31,abund -o SRR7947181.sig.zip --name SRR7947181
This will take about 5 minutes and create three SRR*.sig.zip files.
(You can grab any metagenome you want from ENA, e.g. SRR7947181; use curl -JLO <read URL to download the FASTQ file(s) and sketch them as above!)
Sketch some genomes into k-mer signatures
Use sourmash sketch to turn genomes into k-mers:
for i in /opt/shared/sourmash-data/10sketches/*.fa
do
sourmash sketch dna $i -o genome-$(basename $i .fa).sig.zip \
--name-from-first
done
sourmash sig cat genome*.sig.zip -o 10sketches.sig.zip
Run various analyses on the k-mer sketches!
Genome comparisons via venn and upset
Run:
sourmash scripts venn genome-{2,47,63}.sig.zip \
-o genomes-venn.png --ident
to produce a file genomes-venn.png:

Run:
sourmash scripts upset genome-{2,47,63}.sig.zip \
--show-singletons -o genomes-upset.png
to produce a file genomes-upset.png:

Genome distance matrix, dendrograms, and ordination
First compare the genomes:
sourmash compare 10sketches.sig.zip -o 10sketches.cmp \
--labels-to 10sketches.labels_to.csv
Now build a matrix+dendrogram view:
sourmash scripts plot2 10sketches.cmp 10sketches.labels_to.csv \
-o 10sketches.mat.png
to produce 10sketches.mat.png:

You can produce a metric MDS plot too, colored by species:
sourmash scripts mds 10sketches.cmp 10sketches.labels_to.csv \
-o 10sketches.mds.png \
-C /opt/shared/sourmash-data/10sketches/10sketches-categories.csv
to produce 10sketches.mds.png:

Flat (no abund) metagenome comparisons via venn and upset
Run a venn comparison of metagenomes =>
metag-venn.png.
sourmash scripts venn SRR*.sig.zip \
-o metag-venn.png

Run an upset comparison of metagenomes => metag-upset.png
sourmash scripts upset SRR*.sig.zip \
--show-singletons -o metag-upset.png

Genome presence/absence
Calculate which GTDB genomes are in a metagenome:
sourmash scripts fastgather SRR7947178.sig.zip \
/opt/shared/sourmash-data/gtdb-rs214-k31.zip \
-o SRR7947178.x.gtdb-rs214.csv
(will take ~3 minutes)
Make the detection/abundance plot:
sourmash scripts presence_filter SRR7947178.x.gtdb-rs214.csv \
-o SRR7947178.detection.png
=> SRR7947178.detection.png:

sourmash scripts presence_filter SRR7947178.x.gtdb-rs214.csv \
-o SRR7947178.ani.png \
--ani
=> SRR7947178.ani.png

The three columns being plotted from SRR7947178.x.gtdb-rs214.csv are:
f_match_orig- the detection / containment of the match in the metagenome;match_containment_ani- the containment-based ANI of the match;average_abund- abundance of matching k-mers in the metagenome;
Generating taxonomic classifications for metagenomes with sourmash
The basic workflow is to first run sourmash gather as above, and then run sourmash tax metagenome.
Run gather for each metagenome:
sourmash scripts fastgather SRR7947179.sig.zip \
/opt/shared/sourmash-data/gtdb-rs214-k31.zip \
-o SRR7947179.x.gtdb-rs214.csv
sourmash scripts fastgather SRR7947181.sig.zip \
/opt/shared/sourmash-data/gtdb-rs214-k31.zip \
-o SRR7947181.x.gtdb-rs214.csv
Then you can run taxonomic analyses like so:
sourmash tax metagenome -g *.x.gtdb-rs214.csv \
-t /opt/shared/sourmash-data/gtdb-rs214.lineages.sqldb \
-r order -F human
See the sourmash tax documentations for various output options!
Getting % of metagenome covered from gather
If you've run a fastgather above, you will still need to run sourmash gather to get nice human-readable output; you can do this quickly by using the fastgather output as a picklist to limit the results to the previous calculated output:
sourmash gather --picklist SRR7947179.x.gtdb-rs214.csv::prefetch \
SRR7947179.sig.zip \
/opt/shared/sourmash-data/gtdb-rs214-k31.zip
and you should then see at the end:
found 849 matches total;
the recovered matches hit 83.1% of the abundance-weighted query.
the recovered matches hit 68.7% of the query k-mers (unweighted).
Running abundhist to generate metagenome abundance histograms
After activating the smash conda environment, you can install the abundhist plugin like so:
pip install sourmash_plugin_abundhist
and run it on any metagenome or mixture that has been sketched with -p abund like so:
sourmash scripts abundhist SRR7947178.sig.zip -M 100 \
--figure SRR7947178.abundhist.png
Other topics we can discuss!
- Building your own sourmash database / extending the existing databases with your own private/unpublished genomes
Appendix 1: UNIX Command Line!
Totally new to the command-line, or want to strengthen your foundation? Do this Unix Crash Course.
If you want some experience in other aspects, consider going through these:
Appendix 2: Conda tutorial
Please see my conda tutorial; you'll need to install miniforge if you're not running things on your Jetstream computer.
Appendix 3: examining assembly overlap with k-mers
See my in-class notes.
Contact info
Contact Titus at [email protected].
TODO: make setup instructions / files available somewhere!
installation stuff/sourmash
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
conda env create -n smash-foo -f /opt/shared/sourmash-data/env-smash.yml
Appendix: sourmash env file
cat > env-smash.yml <<EOF
name: smash
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- sourmash>=4.8.10,<5
- seaborn
- scikit-learn
- seaborn
- sourmash_plugin_branchwater==0.9.5
- pip
- pip:
- sourmash_plugin_betterplot
- sourmash_plugin_abundhist
EOF
Appendix 2
PatientHNC_06_illumina
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/008/SRR7947178/SRR7947178_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/008/SRR7947178/SRR7947178_2.fastq.gz
PatientHNC_03_illumina
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/009/SRR7947179/SRR7947179_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/009/SRR7947179/SRR7947179_2.fastq.gz
PatientHNC_10_illumina
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/001/SRR7947181/SRR7947181_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/001/SRR7947181/SRR7947181_2.fastq.gz
Sketching
sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947178_?.fastq.gz -p k=31,abund -o SRR7947178.sig.zip --name SRR7947178
sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947179_?.fastq.gz -p k=31,abund -o SRR7947179.sig.zip --name SRR7947179
sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947181_?.fastq.gz -p k=31,abund -o SRR7947181.sig.zip --name SRR7947181
thoughts for next year -
- emphasize "coverage of metagenome by references"
- ...and then provide alternative suggestions for what to do if coverage is poor :)