sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

`sourmash tax genome` may fail silently with `--force` and incorrect gather inputs

Open bluegenes opened this issue 1 year ago • 1 comments

sourmash tax genome --force fails silently or yields confusing error if you pass in multiple gather results for the same query

I accidentally passed in both k7 and k10 gather results for a set of queries. This nicely fails without --force, saying more than one gather file was found for a particular query. With --force, we read both files in and then aggregate gather results across them. You then MIGHT get an error that the summarized percentage was > 100% (more than 100% of the query was matched), which should never happen. If the percentages were less than 100%, this would have failed silently and given incorrect results.

I think maybe we should never allow seeing a query in multiple gather files (disallow this force behavior).

  • Now that ksize, moltype, and scaled are parameters in the gather csv, we should also check these and only allow summarization over the same params, for SAFETY! This would also have fixed the above issue.

bluegenes avatar Jul 26 '22 19:07 bluegenes

Note that we need --force to continue past empty gather csvs, so fixing this is important (nudges self)

bluegenes avatar Aug 10 '22 17:08 bluegenes

--force now works properly for empty taxonomies, too - fixed in #2218.

ctb avatar Aug 29 '22 16:08 ctb