sourmash
sourmash copied to clipboard
`multigather` CSV output uses signature `filename` as basename.
In #2321 and https://github.com/sourmash-bio/sourmash/pull/2322 we delve back into multigather... and I remembered how annoying the CSV output is, in that it is output to the signature filename
for each query.
At the very least it would be good to have there be an option to put it somewhere else, like an md5sum or something. For 4.x this would be an option and we could make it default for v5.
An alternative is to deprecate multigather per https://github.com/sourmash-bio/sourmash/issues/1614.
should support ident
-based output, as well as md5short
based output.
This is tackled over in #2065 by @olgabot.
A few observations and opinions -
- code in #2065 breaks tests & semantic versioning because it adds md5sum willy nilly. Working on that in https://github.com/sourmash-bio/sourmash/pull/2722 where I add
-U/--output-add-query-md5sum
- it doesn't address issues with 'dumb' filenames like '-' either (which is encoded in the tests, but ...seriously, should be changed/checked for).
- I feel like we should be detecting/flagging file output overwrites anyway?? Some experimental code shows that it actually happens in two of our tests 😱
- specifically, in
test_multigather_metagenome_sbt_query_from_file_with_addl_query
andtest_multigather_metagenome_query_with_sbt_addl_query
, output is overwritten, because the queryGCF_000195995.1_ASM19599v1_genomic.fna.gz
is ingcf_all.sbt.zip
as well.
- specifically, in
Provisional resolution per #2722 would be -
- fail loudly and clearly when overwrites are happening!!
- support
-U/--output-add-query-md5sum
- handle
filename == '-'
- this would be a change in behavior.
Yes to these!
Provisional resolution per #2722 would be -
- fail loudly and clearly when overwrites are happening!!
- support
-U/--output-add-query-md5sum
- handle
filename == '-'
- this would be a change in behavior.
A few more thoughts on https://github.com/sourmash-bio/sourmash/pull/2722 -
- we could also support alternative output formats for
*.matches.sig
and*.unassigned.sig
with-E/--extension
(see https://github.com/sourmash-bio/sourmash/issues/2703, https://github.com/sourmash-bio/sourmash/pull/2712). - we could/should allow overwrites to skip, either with
-f/--force
or with a new flag. Here my concern is that for large enough query databases, there will be sketches with identical md5sum (in which case the output will be the same!) Or... perhaps it would be enough to simply say, if the md5sum is identical, the results are identical, so we're not going to run the gather?
Taking a step back - what do we want to be able to do with multigather?
- analyze large input collections, including singleton collections;
- be assured we have all of the (distinct?) results somewhere and be able to load them!
- identify results for a specific query and separate them out from the rest of the results;
- (maybe) use results from multigather as a picklist?
Things to confirm:
- [ ] as of 4.8.3 (before any of these changes) multigather is "up to date" with gather output
- [ ] query filename matches what's in the nascent docs for multigather (should it be name of file sketched? or name of file from which sketch was loaded?)
- [ ] the filename naming scheme as proposed works for glob-style downstream loading and concatenation
Things to resolve:
- [ ] are we ok with breaking multigather CLI backwards compat? maybe yes: It wasn't documented in any way as of 4.8.3. anyway 🤷
- [ ] what do we do with identical queries (based on md5sum)? do we complain about overwriting content, do we not run them since they've already been run, or do we do something else? and how does this impact downstream parsing?
Just adding a vote here for allowing multigather
to output single csv
and zip
files containing information from all query sigs.
- downstream gather
csv
summarization now uses the query information (name, md5sum, etc) to ensure that summarization is only done for the same query. - For
matches
andunassigned
, we could output each to a zipfile, where individual sigs could then be accessed downstream via picklists or split viasig split
. Sigs within would still need to be named appropriately.
This would likely be especially useful when dealing with extremely large numbers of queries and/or for contig-level gather.
note also connection with contig gather https://github.com/sourmash-bio/sourmash/issues/2564 - sketch genome with --singleton
and then multigather => contig gather.
https://github.com/sourmash-bio/sourmash/pull/2722 has been merged!
I will look through this issue and extract undone things and useful ruminations into a new issue.