sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

`multigather` CSV output uses signature `filename` as basename.

Open ctb opened this issue 2 years ago • 8 comments

In #2321 and https://github.com/sourmash-bio/sourmash/pull/2322 we delve back into multigather... and I remembered how annoying the CSV output is, in that it is output to the signature filename for each query.

At the very least it would be good to have there be an option to put it somewhere else, like an md5sum or something. For 4.x this would be an option and we could make it default for v5.

An alternative is to deprecate multigather per https://github.com/sourmash-bio/sourmash/issues/1614.

ctb avatar Oct 13 '22 13:10 ctb

should support ident-based output, as well as md5short based output.

ctb avatar Oct 15 '22 13:10 ctb

This is tackled over in #2065 by @olgabot.

A few observations and opinions -

  • code in #2065 breaks tests & semantic versioning because it adds md5sum willy nilly. Working on that in https://github.com/sourmash-bio/sourmash/pull/2722 where I add -U/--output-add-query-md5sum
  • it doesn't address issues with 'dumb' filenames like '-' either (which is encoded in the tests, but ...seriously, should be changed/checked for).
  • I feel like we should be detecting/flagging file output overwrites anyway?? Some experimental code shows that it actually happens in two of our tests 😱
    • specifically, in test_multigather_metagenome_sbt_query_from_file_with_addl_query and test_multigather_metagenome_query_with_sbt_addl_query, output is overwritten, because the query GCF_000195995.1_ASM19599v1_genomic.fna.gz is in gcf_all.sbt.zip as well.

Provisional resolution per #2722 would be -

  • fail loudly and clearly when overwrites are happening!!
  • support -U/--output-add-query-md5sum
  • handle filename == '-' - this would be a change in behavior.

ctb avatar Aug 19 '23 16:08 ctb

Yes to these!

Provisional resolution per #2722 would be -

  • fail loudly and clearly when overwrites are happening!!
  • support -U/--output-add-query-md5sum
  • handle filename == '-' - this would be a change in behavior.

olgabot avatar Aug 21 '23 01:08 olgabot

A few more thoughts on https://github.com/sourmash-bio/sourmash/pull/2722 -

  • we could also support alternative output formats for *.matches.sig and *.unassigned.sig with -E/--extension (see https://github.com/sourmash-bio/sourmash/issues/2703, https://github.com/sourmash-bio/sourmash/pull/2712).
  • we could/should allow overwrites to skip, either with -f/--force or with a new flag. Here my concern is that for large enough query databases, there will be sketches with identical md5sum (in which case the output will be the same!) Or... perhaps it would be enough to simply say, if the md5sum is identical, the results are identical, so we're not going to run the gather?

ctb avatar Aug 21 '23 15:08 ctb

Taking a step back - what do we want to be able to do with multigather?

  • analyze large input collections, including singleton collections;
  • be assured we have all of the (distinct?) results somewhere and be able to load them!
  • identify results for a specific query and separate them out from the rest of the results;
  • (maybe) use results from multigather as a picklist?

Things to confirm:

  • [ ] as of 4.8.3 (before any of these changes) multigather is "up to date" with gather output
  • [ ] query filename matches what's in the nascent docs for multigather (should it be name of file sketched? or name of file from which sketch was loaded?)
  • [ ] the filename naming scheme as proposed works for glob-style downstream loading and concatenation

Things to resolve:

  • [ ] are we ok with breaking multigather CLI backwards compat? maybe yes: It wasn't documented in any way as of 4.8.3. anyway 🤷
  • [ ] what do we do with identical queries (based on md5sum)? do we complain about overwriting content, do we not run them since they've already been run, or do we do something else? and how does this impact downstream parsing?

ctb avatar Aug 22 '23 16:08 ctb

Just adding a vote here for allowing multigather to output single csv and zip files containing information from all query sigs.

  1. downstream gather csv summarization now uses the query information (name, md5sum, etc) to ensure that summarization is only done for the same query.
  2. For matches and unassigned, we could output each to a zipfile, where individual sigs could then be accessed downstream via picklists or split via sig split. Sigs within would still need to be named appropriately.

This would likely be especially useful when dealing with extremely large numbers of queries and/or for contig-level gather.

bluegenes avatar Aug 23 '23 15:08 bluegenes

note also connection with contig gather https://github.com/sourmash-bio/sourmash/issues/2564 - sketch genome with --singleton and then multigather => contig gather.

ctb avatar Aug 23 '23 16:08 ctb

https://github.com/sourmash-bio/sourmash/pull/2722 has been merged!

I will look through this issue and extract undone things and useful ruminations into a new issue.

ctb avatar Feb 29 '24 21:02 ctb