C. Titus Brown
C. Titus Brown
some benchmarking: ``` /usr/bin/time -v sourmash sig cat /group/ctbrowngrp5/sourmash-db/gtdb-rs226/gtdb-rs226-reps.k31.sig.zip /group/ctbrowngrp5/sourmash-db/gtdb-rs226/gtdb-rs226-reps.k21.sig.zip -o /scratch/ctbrown/foo.sig.zip ``` yielded: ``` loaded 143384 signatures from '/group/ctbrowngrp5/sourmash-db/gtdb-rs226/gtdb-rsloaded 143384 signatures from '/group/ctbrowngrp5/sourmash-db/gtdb-rs226/gtdb-rsloaded 286768 signatures total, from 2 files...
oh, I think I see the problem: ``` --from-file genbank.20250408.missing.sig.zip.batchlist-with-manual.txt ``` This is using `MultiIndex` which loads everything into memory per https://github.com/sourmash-bio/sourmash/issues/1899 **INCORRECT SEE BELOW**
No, that wasn't it, either -- `--from-file` does the right thing and adds the file contents onto the command line, while using a straight up file w/o `--from-file` does the...
Specifically: ``` Command being timed: "sourmash sig cat --from-file list.txt --o /scratch/ctbrown/foo.sig.zip" User time (seconds): 1166.80 System time (seconds): 29.80 Percent of CPU this job got: 100% Elapsed (wall clock)...
ok, I verified that just using a pathlist directly on the command line is problematic; it loads everything into memory. ``` Command being timed: "sourmash sig cat list.txt --o /scratch/ctbrown/foo.sig.zip"...
ref https://github.com/sourmash-bio/sourmash/issues/3683
https://github.com/sourmash-bio/sourmash/issues/3420
See also [intro to our computing infrastructure](https://hackmd.io/zEpAkqJGTP2DNUmso8jNcw?view)
could/should/might check file contents - e.g. lineage.csv headers: https://github.com/sourmash-bio/sourmash/issues/3628#issuecomment-2872790920
ref https://github.com/sourmash-bio/sourmash/issues/3517 also.