sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

simplify/refactor `load_*` sig/db functions in `sourmash_args.py`

Open ctb opened this issue 2 years ago • 8 comments

There's a confusing mess of loading functions in sourmash_args.py that's slowly converging as we simplify and refactor our Index handling code.

In no particular order,

there's query loading code, load_query_signature. This is responsible for loading a single signature. I think it can probably stay.

there's generic file loading code, load_file_as_signatures and load_file_as_index. I think these can be combined pretty easily now, since they are both simple wrappers around another function.

there's subject loading code, load_many_signatures and load_dbs_and_sigs, which I bet could be combined. The main difference seems to be in diagnostic output.

there's utility functions, traverse_find_sigs and load_pathlist_from_file, which can probably be moved under sourmash.index now, or otherwise refactored out of sourmash_args.

and finally the SignatureLoadingProgress class can probably go away, since CLI functions are mostly moving away from loading long lists of signatures.

ctb avatar Mar 10 '22 15:03 ctb

more generally, per throwaway comment in https://github.com/sourmash-bio/sourmash/pull/1871, would be good to take a holistic look at all the crud in the argument loading/parsing code for commands.py and sig/__main__.py and figure out how much of it can be put in one place. I'm guessing a lot of the signature loading compatibility code can be refactored into a single function.

ctb avatar Mar 10 '22 16:03 ctb

also relevant: picking query ksize automatically to match provided databases https://github.com/sourmash-bio/sourmash/issues/809

ctb avatar Mar 12 '22 16:03 ctb

ref https://github.com/sourmash-bio/sourmash/issues/1894 - remove is_database?

ctb avatar Mar 25 '22 14:03 ctb

ref https://github.com/sourmash-bio/sourmash/issues/1312

ctb avatar May 04 '22 14:05 ctb

ref https://github.com/sourmash-bio/sourmash/issues/1062 and https://github.com/sourmash-bio/sourmash/issues/1060

ctb avatar May 04 '22 14:05 ctb

wow, this has become a tangled mess of interconnected issues with no clear path forward. yay!

the main comment from the issues that resonates most is from https://github.com/sourmash-bio/sourmash/issues/1426:

I'm wondering if the right answer is to track the total number of signatures in a collection (using e.g. manifests) and when doing a search of some kind, provide a generic indicator of what fraction of the collection is actually being searched? This should be straightforward.

ctb avatar Aug 03 '22 10:08 ctb

PR https://github.com/sourmash-bio/sourmash/pull/2204 cleans up signature load/selection reporting for load_dbs_and_sigs.

ctb avatar Aug 14 '22 16:08 ctb