spacegraphcats icon indicating copy to clipboard operation
spacegraphcats copied to clipboard

[WIP] switch protein annotation stuff over to sqlite, for memory

Open ctb opened this issue 3 years ago • 3 comments

Note: PR into https://github.com/spacegraphcats/spacegraphcats/pull/449

The question at hand: can we do terrible things by (a) shoving sourmash signatures for PFAM into SQLite and then (b) using those signatures to annotate graphs?

The answer seems to be: yes, yes we can.

  • [ ] enable picklists
  • [ ] clean up config documentation, etc.
  • [ ] update this initial comment
  • [ ] explore parallelization of PFAM-scale search via picklists and snakemake or parallel or ...

cc @taylorreiter

ctb avatar Jan 22 '22 19:01 ctb

Commands that are no longer obviously failing:

../2022-sourmash-sqlite/sketch-fasta-to-sqlite.py shew/GCF_000017325.1_ASM1732v1_protein.faa.gz \
    shew/GCF_000021665.1_ASM2166v1_protein.faa.gz -o shew-prots.db --scaled=20 -k 10
python -m spacegraphcats run twofoo-sqlite.yaml twofoo_k31_r1_multifasta_x/multifasta_x.cdbg_annot.csv -p

with config file twofoo-sqlite.yaml:

catlas_base: twofoo
input_sequences:
- twofoo.fq.gz
ksize: 31
radius: 1
search:
- data/2.fa.gz
- data/47.fa.gz
- data/63.fa.gz
shadow_ratio_maxsize: 1000

hashval_ksize: 51
hashval_queries: data/twofoo-k51-hashval-queries.txt

multifasta_reference:
- data/twofoo-genes.fa.gz
multifasta_scaled: 1000
multifasta_query_sig: data/63-os223.sig

multifasta_x_reference:
- shew-prots.db
multifasta_x_protein_ksize: 10
multifasta_x_query_is_dna: false
#multifasta_x_search_mode: search+nbhd
multifasta_x_search_mode: gather+cdbg
#multifasta_x_search_mode: search+nbhd

ctb avatar Jan 22 '22 19:01 ctb

Codecov Report

Merging #454 (718a8a1) into prot_gather (6a6e017) will increase coverage by 0.10%. The diff coverage is 91.66%.

:exclamation: Current head 718a8a1 differs from pull request most recent head 04c6978. Consider uploading reports for the commit 04c6978 to get more accurate results

@@               Coverage Diff               @@
##           prot_gather     #454      +/-   ##
===============================================
+ Coverage        82.83%   82.93%   +0.10%     
===============================================
  Files               61       61              
  Lines             5260     5304      +44     
===============================================
+ Hits              4357     4399      +42     
- Misses             903      905       +2     
Impacted Files Coverage Δ
...pacegraphcats/search/index_cdbg_by_multifasta_x.py 93.63% <90.90%> (+0.45%) :arrow_up:
tests/test_dory_workflow.py 97.03% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6a6e017...04c6978. Read the comment docs.

codecov-commenter avatar Jan 22 '22 20:01 codecov-commenter

so, this seems to be painfully slow for scaled=1. But it works great for scaled=10.

Thinking on't, this overall approach may also solve some of @bluegenes concerns over how to do large-scale annotation. I mean, that's actually obvious in hindsight, but it's nice to write one batch of code and come up with solutions for multiple long-standing problems 🤔

ctb avatar Jan 22 '22 20:01 ctb