sourmash
sourmash copied to clipboard
simplify/refactor `load_*` sig/db functions in `sourmash_args.py`
There's a confusing mess of loading functions in sourmash_args.py
that's slowly converging as we simplify and refactor our Index handling code.
In no particular order,
there's query loading code, load_query_signature
. This is responsible for loading a single signature. I think it can probably stay.
there's generic file loading code, load_file_as_signatures
and load_file_as_index
. I think these can be combined pretty easily now, since they are both simple wrappers around another function.
there's subject loading code, load_many_signatures
and load_dbs_and_sigs
, which I bet could be combined. The main difference seems to be in diagnostic output.
there's utility functions, traverse_find_sigs
and load_pathlist_from_file
, which can probably be moved under sourmash.index
now, or otherwise refactored out of sourmash_args.
and finally the SignatureLoadingProgress
class can probably go away, since CLI functions are mostly moving away from loading long lists of signatures.
more generally, per throwaway comment in https://github.com/sourmash-bio/sourmash/pull/1871, would be good to take a holistic look at all the crud in the argument loading/parsing code for commands.py
and sig/__main__.py
and figure out how much of it can be put in one place. I'm guessing a lot of the signature loading compatibility code can be refactored into a single function.
also relevant: picking query ksize automatically to match provided databases https://github.com/sourmash-bio/sourmash/issues/809
ref https://github.com/sourmash-bio/sourmash/issues/1894 - remove is_database
?
ref https://github.com/sourmash-bio/sourmash/issues/1312
ref https://github.com/sourmash-bio/sourmash/issues/1062 and https://github.com/sourmash-bio/sourmash/issues/1060
wow, this has become a tangled mess of interconnected issues with no clear path forward. yay!
the main comment from the issues that resonates most is from https://github.com/sourmash-bio/sourmash/issues/1426:
I'm wondering if the right answer is to track the total number of signatures in a collection (using e.g. manifests) and when doing a search of some kind, provide a generic indicator of what fraction of the collection is actually being searched? This should be straightforward.
PR https://github.com/sourmash-bio/sourmash/pull/2204 cleans up signature load/selection reporting for load_dbs_and_sigs
.