sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

tutorial: combine results from multiple `fastgather` database searches using picklists

Open ctb opened this issue 1 month ago • 0 comments

https://hackmd.io/ph9Q4nGHTb6T1mVJSV5yqw?view

ref https://github.com/sourmash-bio/sourmash/issues/3233#issuecomment-2203433558, cc @musquita -

combining results from multiple different gathers

Suppose you are using fastgather, and you want to combine the results from searching against multiple zip files.

Or, more generally, you have a bunch of gather results from a bunch of different databases, and you want to combine them all and do one final gather, to get the one gather result to rule them all.

You have two options:

Option 1: merge the gather results and build a new (smaller) database

In brief,

  • merge the gather results;
  • use them as a picklist with sourmash sig cat to build a new database;
  • run gather/fastgather against that database;

This has the advantage of being pretty easy, and you end up with a single .zip file so you can use fastgather or fastmultigather to do the final search.

Option 2: merge the gather results and use them as a picklist with sourmash gather

In brief,

  • merge the gather results;
  • run sourmash gather with them as a picklist, against multiple databases.

This has the advantage of involving less duplication of data - you're not creating a new database - but it might be slower.

Example

Construct some fake DBs:

sourmash sig grep Akkermansia gtdb-rs207.genomic.k31.zip -o akker.db.zip
sourmash sig grep Shewanella gtdb-rs207.genomic.k31.zip -o shew.db.zip

Make a fake metagenome:

sourmash sig merge podar-ref/{2,47,63}.fa.sig -o fake-metag.sig.zip -k 31

Build two fastgather results:

sourmash scripts fastgather fake-metag.sig.zip shew.db.zip -o fake.x.shew.fastgather.csv

sourmash scripts fastgather fake-metag.sig.zip akker.db.zip -o fake.x.akker.fastgather.csv

Combine the fastgather results:

# copy first file
cp fake.x.shew.fastgather.csv combined.fastgather.csv

# concatenate all but the header line of the second file onto the first; do this for all remaining files
tail +2 fake.x.akker.fastgather.csv >> combined.fastgather.csv

Note: When using fastgather output in picklists, use picklist file type prefetch because the fastgather and fastmultigather commands output CSVs with match_name instead of name. If you're using gather output as a picklist, use picklist file type gather. Apologies for the confusion; this will be resolved in sourmash v5.

Example of option 1: build a new database

Now, use the fastgather results as a picklist to build a new subset database:

sourmash sig cat --picklist combined.fastgather.csv::prefetch \
    akker.db.zip shew.db.zip \
    -o combined.db.zip

Now rerun fastgather against the combined database:

sourmash scripts fastgather fake-metag.sig.zip combined.db.zip \
    -o fake.x.combined.csv

Fun, profit! Your final gather results are in fake.x.combined.csv.

Example of option 2: use a picklist with sourmash gather

You can generate the same result as above, but without building a new database, by running sourmash gather with a picklist and multiple databases:

sourmash gather --picklist combined.fastgather.csv::prefetch \
    fake-metag.sig.zip \
    akker.db.zip shew.db.zip \
    -o fake.x.combined.2.csv

ctb avatar Jul 03 '24 04:07 ctb