bacass
bacass copied to clipboard
kmerfinder update and optimization
Description of feature
I very much appreciate the functionality implemented with kmerfinder (that is automatic search for close genome and running Quast with it). However I'm running into several issues with current implementation in bacass
-
Kmerdb must be provided as
tar.gz
-> this leads to excessive storage usage and need to unpack the archive on each run of the pipeline (without-resume
). I suggest to allow to pass the db directory in unpacked form. -
The only kmerdb, which I've found to work is exactly the one stated in the
bacass
documentation, ((dated 2019/01/08) https://zenodo.org/records/10458361/files/20190108_kmerfinder_stable_dirs.tar.gz
) however, according to zenodo, this is malformed and updated version of the db is deposited at zenodo, which however, appears not to work with the pipeline. More over, this database is quite old; newer versions ofkmerfinder
dbs are deposited atftp://ftp.cbs.dtu.dk/public/CGE/databases/KmerFinder/version/
, latest there appears to be from 10/2021 (also oldish). Even more recent is accessiblehttps://cge.food.dtu.dk/services/KmerFinder/
from 2022 (haven't tested yet, 63GB download). -
The need to provide
--ncbi_assembly_metadata
(which are updated by ncbi) leads to inconsistencies between the metadata and kmerfinder db, when assembly is made obsolete (check the venn diagram from the database and current ncbi refseq assembly metedata). I can see, that it would be problematic to have 100% 1:1 correspondence, as the updates to NCBI are frequent, but now, the pipeline fails when the best-match-assembly is not present in the metadata (I've encountered this with my data and that's why I've started digging around). Beside updating the database I have few ideas on how to obtain the assembly without need to refer to the metadata file: a) in the zenodo db, in thebacteria.name
there is complete assembly id (col 3) which can be used to construct the download link directly. (This will fix some cases, assuppressed
records are still available albeit not present in the metadata table). b) in newer kmerfinder dbs there isbacteria.tax
which containassembly id
(also can be extracted frombacteria.name
col 3), which can be used ini) use `NCBI datasets` (online API docs https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/rest-api/#get-/genome/accession/-accessions-/download) ii) nf-core module https://nf-co.re/modules/ncbigenomedownload/ (no experience here) iii) rsync/wget/aspera download of the GCF/xxx/yyy/zzz directory
I'm also wondering if similar functionality could be implemented with kraken2 (and its database), so one could have one (possibly larger) database and use it for contamination screen and most similar genome identification...
I do not have experience in writing nextflow
pipelines, but I'm willing to write some python scripts e.g. for interacting with NCBI api.