bacass icon indicating copy to clipboard operation
bacass copied to clipboard

kmerfinder update and optimization

Open SchwarzMarek opened this issue 5 months ago • 1 comments

Description of feature

I very much appreciate the functionality implemented with kmerfinder (that is automatic search for close genome and running Quast with it). However I'm running into several issues with current implementation in bacass

  1. Kmerdb must be provided as tar.gz -> this leads to excessive storage usage and need to unpack the archive on each run of the pipeline (without -resume). I suggest to allow to pass the db directory in unpacked form.

  2. The only kmerdb, which I've found to work is exactly the one stated in the bacass documentation, ((dated 2019/01/08) https://zenodo.org/records/10458361/files/20190108_kmerfinder_stable_dirs.tar.gz) however, according to zenodo, this is malformed and updated version of the db is deposited at zenodo, which however, appears not to work with the pipeline. More over, this database is quite old; newer versions of kmerfinder dbs are deposited at ftp://ftp.cbs.dtu.dk/public/CGE/databases/KmerFinder/version/, latest there appears to be from 10/2021 (also oldish). Even more recent is accessible https://cge.food.dtu.dk/services/KmerFinder/ from 2022 (haven't tested yet, 63GB download).

  3. The need to provide --ncbi_assembly_metadata (which are updated by ncbi) leads to inconsistencies between the metadata and kmerfinder db, when assembly is made obsolete (check the venn diagram from the database and current ncbi refseq assembly metedata). I can see, that it would be problematic to have 100% 1:1 correspondence, as the updates to NCBI are frequent, but now, the pipeline fails when the best-match-assembly is not present in the metadata (I've encountered this with my data and that's why I've started digging around). Beside updating the database I have few ideas on how to obtain the assembly without need to refer to the metadata file: a) in the zenodo db, in the bacteria.name there is complete assembly id (col 3) which can be used to construct the download link directly. (This will fix some cases, as suppressed records are still available albeit not present in the metadata table). b) in newer kmerfinder dbs there is bacteria.tax which contain assembly id (also can be extracted from bacteria.name col 3), which can be used in

     i) use `NCBI datasets` (online API docs https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/rest-api/#get-/genome/accession/-accessions-/download) 
     ii) nf-core module https://nf-co.re/modules/ncbigenomedownload/ (no experience here)
     iii)  rsync/wget/aspera download of the GCF/xxx/yyy/zzz directory 
    

venn

I'm also wondering if similar functionality could be implemented with kraken2 (and its database), so one could have one (possibly larger) database and use it for contamination screen and most similar genome identification...

I do not have experience in writing nextflow pipelines, but I'm willing to write some python scripts e.g. for interacting with NCBI api.

SchwarzMarek avatar Sep 19 '24 14:09 SchwarzMarek