sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

Splitting table with strain-level resolution (R code)

Open jorondo1 opened this issue 2 years ago • 2 comments

Hi @taylorreiter,

I recently tried out your tutorial to run Sourmash Gather and import the output into a phyloseq object (discovered during STAMPS!). It worked out super well on the first try!

However, my second try was using Genebank db, which seems to allow strain-level identifications, contrary to GTDB. This resulted in an eight "piece" which the R script wouldn't handle when separating the lineages.

Turns out simply adding the fill="right" option to separate allows adding a "strains" to the list without getting an error, if you ever want to update the tutorial:

tax_table <- sourmash_taxonomy_results %>%
  select(name, lineage) %>%
  distinct() %>%
  separate(lineage, into = c("domain", "phylum", "class", "order", "family", "genus", "species","strain"), sep = ";", fill="right) %>%
  column_to_rownames("name") %>%
  select_if(~sum(!is.na(.)) > 0)

Also, I noticed the barplot code is not there, the ps object creation command is simply copy-pasted there!

Thanks again for this awesome tutorial, it saved me hours of work.

EDIT : added a line to drop the strain column if it's only NAs (so the code works with and without strain-level resolution)

jorondo1 avatar Sep 22 '22 15:09 jorondo1

I realised we can aggregate to species level using sourmash tax metagenome , so my first remark might not be as relevant as I thought.

jorondo1 avatar Sep 22 '22 19:09 jorondo1

thank you reporting this!

taylorreiter avatar Oct 05 '22 16:10 taylorreiter