sourmash
sourmash copied to clipboard
expanding database selection methods: metadata
Discussion in https://github.com/sourmash-bio/sourmash/pull/2178 reminded me of some things @ctb and I talked around a while back, and that seem like far less of a leap now.
With new selection and subsetting functionalities being increasingly fleshed out and useful:sig grep, tax grep, sig extract, tax extract
, etc - we could generally enable a manifest-style file with metadata (METADATA.csv/sql
?) for signatures and support (generating picklists for) subsetting across it.
Current use case example:
When we run MAGsearch, we postprocess the results to link matches with their SRA metadata. We could instead (or in addition) build a lineages-style sqldb for SRA runinfo metadata as a complementary manifest.
This would allow us to do:
- metadata selection, e.g. "seawater metagenome" to enable SRA search/MAGsearch on just samples with that metadata. This could be really handy for times where we don't want to search the entire database -- assuming picklists make it into SRA search, I guess. It would be extra neat if metadata categories were hierarchical so that we could use extract to scale up, but afaik that's not how the info is organized, so this is more of a dream than a concrete use case.
-
tax annotate
-style annotation (or perhapsmetadata annotate
?)
As with the current functions, we use the metadata to select the identifiers we want, which we then use to select signatures for output/search/etc.
The most proximal use case is for MAGsearch, but I think could also be really useful for reference databases if there was additional metadata that would be useful to subset on -- e.g. quality, completeness, contamination, database source.
Ok this part is far less well-defined: Thinking a bit about LIN groups and taxonomy that does not fit our current standard hierarchy. I wonder if we could allow these in the metadata file, with a corresponding json or similar that defines any (optional) hierarchical nature of the categories.
I guess the way I'm thinking about it is that taxonomy
is a specific case, but metadata
could be more flexible. @ctb there was a specific sort of tagging you suggested we could tie into when we talked about this (...last year??), but I can't remember the details.
see folksonomies in particular, mentioned in https://github.com/sourmash-bio/sourmash/issues/1916 and https://github.com/sourmash-bio/sourmash/issues/268#issuecomment-305990650
continuing that thought - sig grep
seems like the places to do this, or perhaps something specific to manifests where we can link signature identifiers/names to generic metadata.
I think expanding standalone manifests to support this kind of thing is the way to go - explicit shoutout to https://github.com/sourmash-bio/sourmash/issues/1916.