[WIP] switch protein annotation stuff over to sqlite, for memory
Note: PR into https://github.com/spacegraphcats/spacegraphcats/pull/449
The question at hand: can we do terrible things by (a) shoving sourmash signatures for PFAM into SQLite and then (b) using those signatures to annotate graphs?
The answer seems to be: yes, yes we can.
- [ ] enable picklists
- [ ] clean up config documentation, etc.
- [ ] update this initial comment
- [ ] explore parallelization of PFAM-scale search via picklists and snakemake or parallel or ...
cc @taylorreiter
Commands that are no longer obviously failing:
../2022-sourmash-sqlite/sketch-fasta-to-sqlite.py shew/GCF_000017325.1_ASM1732v1_protein.faa.gz \
shew/GCF_000021665.1_ASM2166v1_protein.faa.gz -o shew-prots.db --scaled=20 -k 10
python -m spacegraphcats run twofoo-sqlite.yaml twofoo_k31_r1_multifasta_x/multifasta_x.cdbg_annot.csv -p
with config file twofoo-sqlite.yaml:
catlas_base: twofoo
input_sequences:
- twofoo.fq.gz
ksize: 31
radius: 1
search:
- data/2.fa.gz
- data/47.fa.gz
- data/63.fa.gz
shadow_ratio_maxsize: 1000
hashval_ksize: 51
hashval_queries: data/twofoo-k51-hashval-queries.txt
multifasta_reference:
- data/twofoo-genes.fa.gz
multifasta_scaled: 1000
multifasta_query_sig: data/63-os223.sig
multifasta_x_reference:
- shew-prots.db
multifasta_x_protein_ksize: 10
multifasta_x_query_is_dna: false
#multifasta_x_search_mode: search+nbhd
multifasta_x_search_mode: gather+cdbg
#multifasta_x_search_mode: search+nbhd
Codecov Report
Merging #454 (718a8a1) into prot_gather (6a6e017) will increase coverage by
0.10%. The diff coverage is91.66%.
:exclamation: Current head 718a8a1 differs from pull request most recent head 04c6978. Consider uploading reports for the commit 04c6978 to get more accurate results
@@ Coverage Diff @@
## prot_gather #454 +/- ##
===============================================
+ Coverage 82.83% 82.93% +0.10%
===============================================
Files 61 61
Lines 5260 5304 +44
===============================================
+ Hits 4357 4399 +42
- Misses 903 905 +2
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...pacegraphcats/search/index_cdbg_by_multifasta_x.py | 93.63% <90.90%> (+0.45%) |
:arrow_up: |
| tests/test_dory_workflow.py | 97.03% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 6a6e017...04c6978. Read the comment docs.
so, this seems to be painfully slow for scaled=1. But it works great for scaled=10.
Thinking on't, this overall approach may also solve some of @bluegenes concerns over how to do large-scale annotation. I mean, that's actually obvious in hindsight, but it's nice to write one batch of code and come up with solutions for multiple long-standing problems 🤔