ampliseq Add GBIF adapted CO1 database

Description of feature

@johnne is working on making a version of the CO1 database better adapted to the GBIF backbone taxonomy. When that is available, make sure it's selectable in Ampliseq.

Aug 30 '22 15:08 erikrikarddaniel

The database is up now at https://scilifelab.figshare.com/articles/dataset/COI_reference_sequences_from_BOLD_DB/20514192

Sep 15 '22 20:09 johnne

@erikrikarddaniel How was it now with taxonomic ranks used in ampliseq? I see the finished assignTaxonomy.fna file has e.g. Bacteria;Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus as the header so it's using taxlevels "Domain", "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species". The custom coidb has "Kingdom", "Phylum", "Class","Order","Family","Genus","Species" as taxlevels. So should we aim to create a assignTaxonomy.fna file that duplicates the Kingdom value to Domain or instead use the database with --dada_assign_taxlevels that I see in the documentation?

Cheers, John

Sep 18 '22 19:09 johnne

I think it would be best to add a taxlevels entry in the conf/ref_databases.config; see e.g. the midori2-co1 entry.

Sep 19 '22 05:09 erikrikarddaniel

@erikrikarddaniel @jtangrot I'm a little unsure on how to use the fmtscript part of the database entries in the workflow. Does it have to be a shell script, or can it be a python script in the bin/ folder? And do you have suggestion for how to test if/how such a script works within the workflow?

Oct 05 '22 19:10 johnne

It can, AFAIK, be any kind of script but it will be executed by the current container, so the interpreter needs to be in that and findable by whatever you have in #!. The best test is to add a test for the database you're adding. Add a config file, include that in nextflow.config and add to .github/workflows/ci.yml.

Oct 07 '22 08:10 erikrikarddaniel

I wasn't able to use a python script because python is not included in the ubuntu:20.04 image used for the FORMAT_TAXONOMY process.

I suggest adding a container keyword to the ref_databases.config file that allows to use any container for the formatting script. See this commit

Oct 12 '22 03:10 johnne

If using that container keyword works fine so far this seems to me as a valid option. However, that would need to be also working for conda & singularity, that might complicate things? Alternatively, the ubuntu container isnt set in stone and you could have a look for another lightweight container that includes both, python and bash. That way one container (per conda, docker & singularity) would serve it all.

Oct 12 '22 07:10 d4straub

I agree with my namesake, and was perhaps leaning towards relying on one image that includes all, which means both python and bash, but also sed. Not working with singularity would of course be a show stopper.

Oct 12 '22 07:10 erikrikarddaniel

It's probably not too difficult to make it work for singularity as well (it works already for conda) it's more that my groovy/nextflow skills are limited still. But having a universal image sounds better. I'll see what I can find. Is there a specific reason you're pulling singularity images from https://depot.galaxyproject.org/singularity/?

Oct 12 '22 08:10 johnne

@d4straub How lightweight does the container have to be? Of course it has to contain bash, so the slimmest python containers are out. But I see nextflow also relies on the ps command which isn't included in some of the slightly bigger (~ 110M) containers that do contain bash. The python:3.7 container seems to work though, but it's 900M.

Oct 12 '22 08:10 johnne

Is there a specific reason you're pulling singularity images from https://depot.galaxyproject.org/singularity/?

I heard (never tried myself) that pulling docker images with singularity is more resource intensive and might be less reproducible than using singularity images directly (which are at https://depot.galaxyproject.org/singularity/).

How lightweight does the container have to be?

No specific threshold, I just wanted to make aware that some images are really unnecessarily large for the task at hand. But just take what works. I use singularity and the container size varies from 11MB (vsearch) to 2.5GB (QIIME2). For example CUTADAPT_SUMMARY with depot.galaxyproject.org-singularity-python-3.8.3.img is only 91MB. Processes ASSIGNSH & TRUNCLEN & FILTER_STATS & FORMAT_TAXRESULTS use pandas:1.1.5 for python scripts with the singularity container being depot.galaxyproject.org-singularity-pandas-1.1.5.img 97MB, that seems also random to me and probably we used there just a small image. edit: its also of advantage to take the same container for several processes instead of different containers in each process, because than just one container needs to be downloaded. In practice that might have actually low impact.

Oct 12 '22 08:10 d4straub

It works with quay.io/biocontainers/python:3.8.3 for docker. However, I don't feel confident in getting all checks to work, see https://github.com/NBISweden/ampliseq/pull/2.

Oct 18 '22 05:10 johnne

In the meantime

nf-core tools had an update --> nf-core linting fails in your fork (nf-core/ampliseq dev branch had its update and is passing)
nextflow had an update --> ugly warnings about process names
nf-core/ampliseq dev branch had updates that you seem not to have pulled yet

but nextflow run nf-core/ampliseq -r dev -profile test,singularity still (newest nextflow version) works locally for me (despite warnings) and nextflow run nf-core/ampliseq -r dev -profile test_iontorrent,singularity is also fine. Just confirmed. Not sure that helps though...

Oct 19 '22 08:10 d4straub

That seems to be completed by #518 and #534 ?

Feb 07 '23 15:02 d4straub

I would say so, yes.

Feb 07 '23 16:02 erikrikarddaniel

Then I'll close it for now.

Feb 08 '23 09:02 d4straub

ampliseq ampliseq copied to clipboard

Add GBIF adapted CO1 database

Description of feature

ampliseq
ampliseq copied to clipboard