ampliseq
ampliseq copied to clipboard
Add GBIF adapted CO1 database
Description of feature
@johnne is working on making a version of the CO1 database better adapted to the GBIF backbone taxonomy. When that is available, make sure it's selectable in Ampliseq.
The database is up now at https://scilifelab.figshare.com/articles/dataset/COI_reference_sequences_from_BOLD_DB/20514192
@erikrikarddaniel
How was it now with taxonomic ranks used in ampliseq? I see the finished assignTaxonomy.fna file has e.g. Bacteria;Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus as the header so it's using taxlevels "Domain", "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species". The custom coidb has "Kingdom", "Phylum", "Class","Order","Family","Genus","Species" as taxlevels. So should we aim to create a assignTaxonomy.fna file that duplicates the Kingdom value to Domain or instead use the database with --dada_assign_taxlevels that I see in the documentation?
Cheers, John
I think it would be best to add a taxlevels entry in the conf/ref_databases.config; see e.g. the midori2-co1 entry.
@erikrikarddaniel @jtangrot
I'm a little unsure on how to use the fmtscript part of the database entries in the workflow. Does it have to be a shell script, or can it be a python script in the bin/ folder? And do you have suggestion for how to test if/how such a script works within the workflow?
It can, AFAIK, be any kind of script but it will be executed by the current container, so the interpreter needs to be in that and findable by whatever you have in #!.
The best test is to add a test for the database you're adding. Add a config file, include that in nextflow.config and add to .github/workflows/ci.yml.
I wasn't able to use a python script because python is not included in the ubuntu:20.04 image used for the FORMAT_TAXONOMY process.
I suggest adding a container keyword to the ref_databases.config file that allows to use any container for the formatting script. See this commit
If using that container keyword works fine so far this seems to me as a valid option. However, that would need to be also working for conda & singularity, that might complicate things? Alternatively, the ubuntu container isnt set in stone and you could have a look for another lightweight container that includes both, python and bash. That way one container (per conda, docker & singularity) would serve it all.
I agree with my namesake, and was perhaps leaning towards relying on one image that includes all, which means both python and bash, but also sed. Not working with singularity would of course be a show stopper.
It's probably not too difficult to make it work for singularity as well (it works already for conda) it's more that my groovy/nextflow skills are limited still. But having a universal image sounds better. I'll see what I can find. Is there a specific reason you're pulling singularity images from https://depot.galaxyproject.org/singularity/?
@d4straub How lightweight does the container have to be? Of course it has to contain bash, so the slimmest python containers are out. But I see nextflow also relies on the ps command which isn't included in some of the slightly bigger (~ 110M) containers that do contain bash. The python:3.7 container seems to work though, but it's 900M.
Is there a specific reason you're pulling singularity images from https://depot.galaxyproject.org/singularity/?
I heard (never tried myself) that pulling docker images with singularity is more resource intensive and might be less reproducible than using singularity images directly (which are at https://depot.galaxyproject.org/singularity/).
How lightweight does the container have to be?
No specific threshold, I just wanted to make aware that some images are really unnecessarily large for the task at hand. But just take what works. I use singularity and the container size varies from 11MB (vsearch) to 2.5GB (QIIME2). For example CUTADAPT_SUMMARY with depot.galaxyproject.org-singularity-python-3.8.3.img is only 91MB. Processes ASSIGNSH & TRUNCLEN & FILTER_STATS & FORMAT_TAXRESULTS use pandas:1.1.5 for python scripts with the singularity container being depot.galaxyproject.org-singularity-pandas-1.1.5.img 97MB, that seems also random to me and probably we used there just a small image. edit: its also of advantage to take the same container for several processes instead of different containers in each process, because than just one container needs to be downloaded. In practice that might have actually low impact.
It works with quay.io/biocontainers/python:3.8.3 for docker. However, I don't feel confident in getting all checks to work, see https://github.com/NBISweden/ampliseq/pull/2.
In the meantime
- nf-core tools had an update --> nf-core linting fails in your fork (nf-core/ampliseq dev branch had its update and is passing)
- nextflow had an update --> ugly warnings about process names
- nf-core/ampliseq dev branch had updates that you seem not to have pulled yet
but nextflow run nf-core/ampliseq -r dev -profile test,singularity still (newest nextflow version) works locally for me (despite warnings) and nextflow run nf-core/ampliseq -r dev -profile test_iontorrent,singularity is also fine. Just confirmed. Not sure that helps though...
That seems to be completed by #518 and #534 ?
I would say so, yes.
Then I'll close it for now.