ampliseq Adding the new Greengenes2 database for classification

Description of feature

Greengenes2 recently came out. Greengenes2 is a new release of the Greengenes database that has been redesigned from the ground up and backed by whole genomes, focusing on harmonizing 16S rRNA and shotgun metagenomic datasets. It is also much larger than past resources in its phylogenetic coverage, as compared to SILVA, Greengenes and GTDB. It would be great to add this database as an optional feature for classifying sequences. Usage instructions are below. It has a QIIME 2 plugin. Notice that the approaches to classify sequences is different between V4 and non-V4 sequences.

Paper: https://www.nature.com/articles/s41587-023-01845-1 How to use it: https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291

Nov 09 '23 17:11 aimirza

Hi there, yes that is an interesting database indeed. I dislike however that its very much centered on QIIME2 and the V4-region. GTDB also allows harmonizing between 16S and shotgun metagenomics and that is available in ampliseq & mag already.

Greengenes2 was discussed in https://nfcore.slack.com/archives/CEA7TBJGJ/p1690539708378009 & https://nfcore.slack.com/archives/CEA7TBJGJ/p1678204777328909. Using --skip_dada_taxonomy --classifier http://ftp.microbio.me/greengenes_release/current/2022.10.backbone.full-length.nb.qza might do the job (not tested!). Feedback would be appreciated. Otherwise preprocessing the database with QIIME2 v2023.7 (that is used in ampliseq v2.7.0) and providing the classifier to the pipeline with --classifier should work currently.

I hope for the integration of Greengenes2 for DADA2 classifications, that should solve all preprocessing and make the db integration relatively easy to add here, including an upload to Zenodo which is much preferred to a university DB. Greengenes2 was said to be "soon-ish" provided as DADA2 database in Zenodo, see https://github.com/benjjneb/dada2/issues/1680 and https://github.com/benjjneb/dada2/issues/1829.

Nov 10 '23 07:11 d4straub

Greengenes2 support is now for QIIME2 available in the dev branch and will be in the next release. I dont close that issue though because there is still no news for DADA2 (or I missed it).

Jan 12 '24 08:01 d4straub

Hi Daniel.

Thank you for the update. I do see greengenes2 in the github page but it doesnt show as one of the parameter options on your nextflow page. Just informing you. I havent used it yet but I plan to very soon.

Best regards, Ali

On Fri, Jan 12, 2024 at 12:04 AM Daniel Straub @.***> wrote:

Greengenes2 support is now for QIIME2 available in the dev branch and will be in the next release. I dont close that issue though because there is still no news for DADA2 (or I missed it).

— Reply to this email directly, view it on GitHub https://github.com/nf-core/ampliseq/issues/658#issuecomment-1888612986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFFT7KL5RQQF4SUYKTCDKTYODVCPAVCNFSM6AAAAAA7E5LGKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBYGYYTEOJYGY . You are receiving this because you authored the thread.Message ID: @.***>

--

Ali Mirza, Ph.D.

IMPACTT Bioinformatician at Simon Fraser University, Burnaby, Canada

Email: @.*** https://www.mail.ubc.ca/owa/redir.aspx?C=IFx7SlyOwIiDLsCxgwvbmXMj4tAPRlHKUpt0pxg2k7ZC40YDoZbVCA..&URL=mailto%3aamirza%40alumni.ubc.ca or @.***

LinkedIn URL: www.linkedin.com/in/ali-i-mirza

Aug 19 '24 18:08 aimirza

Hi @aimirza ,

it seems that greengenes2 is an option for --qiime_ref_taxonomy as in https://nf-co.re/ampliseq/2.11.0/parameters/#qiime_ref_taxonomy... where would you expect to appear "greengenes2" as option where it doesnt?

Sep 02 '24 08:09 d4straub

My mistake, I was looking at --dada_ref_taxonomy .

Sep 02 '24 15:09 aimirza

Paper: https://www.nature.com/articles/s41587-023-01845-1 How to use it: https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291 How to use it: https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291

How are you using qiime2 to classify ASVs with the greengenes2 database? Are you following the 'How to use it' guidelines from the link you shared or are you using a pre-trained classifier?

Sep 02 '24 15:09 aimirza

How are you using qiime2 to classify ASVs with the greengenes2 database?

The following files are used https://github.com/nf-core/ampliseq/blob/0473e157ac9a7b1d36845ff9f8fa7ac843d3b234/conf/ref_databases.config#L373-L378 to extract sequences with primers and train the classifier.

Sep 03 '24 09:09 d4straub

Wow, extracting reads (QIIME2_EXTRACT) takes a long time. It ran for a day and got canceled because of the default 1 day limit. I increased the limit and now waiting. Since it takes so long, it would be nice to have the option to use qiime2's simple and super quick classification method for V4 regions, which is to set intersection between the ASVs and what exists in the database. No training or classifiers needed. The issue with this approach is that ASVs not found in the database wont be classified. But most ASVs should get classified, they say.

Sep 04 '24 01:09 aimirza

QIIME2_EXTRACT is running 8h 58m on our hpc

yes, I tested it and it takes long, check out https://github.com/nf-core/ampliseq/pull/666#issuecomment-1888609190 It was implemented that way because it caters to every use case, not just V4. If you want to implement and open a PR with the super quick classification method, that would be nice ofc.

Sep 04 '24 06:09 d4straub

Changing the time limit doesn't seem to work properly. I supplied new config rules to the -c parameter, such as:

process {
  
  withName:QIIME2_EXTRACT {
      cpus   = 2
      memory = 42.GB
      time   = 500.h
      }

}

I also tried the codes below, but it still failed after 1 day:

process {

    cpus   = 2
    memory = 42.GB
    time   = 500.h

}

Sep 06 '24 02:09 aimirza

What about the cpus and memory, are they altered successfully? If yes, check your --max_time setting, maybe another config is overwriting it?

Sep 06 '24 06:09 d4straub

I also had set --max_time to 500h. Below is my sbatch script:

#!/bin/bash -l
#SBATCH --time=3-12:00:00
#SBATCH --nodes=4
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G          
#SBATCH --error=%x-%j.error
#SBATCH --output=%x-%j.out


nextflow run main.nf \
        -profile singularity \
        -c /home/amirza/projects/def-sponsor01/data_share/ampliseq/gg2.config \
        --input_fasta ./results_full2/dada2/ASV_seqs.fasta \
        --FW_primer GTGYCAGCMGCCGCGGTAA \
        --RV_primer GGACTACNVGGGTWTCTAAT \
        --metadata "Metadata_rename_with_batch_info.tsv" \
        --outdir ./test_gg2 \
        --ignore_empty_input_files \
        --ignore_failed_trimming \
        --qiime_ref_taxonomy greengenes2 \
        --skip_dada_taxonomy \
        --skip_qiime_downstream \
        --validate_params \
        --max_cpus 8 \
        --max_memory 84.GB \
        --max_time 500h \
        --skip_barrnap \
        --skip_fastqc \
        -resume

I also don't see multiple jobs running at the same time. The only related parameters I see listed in the log file is --max_cpus, --max_memory and --max_time.

The supplied config file (gg2.config) is:

process {

  withName:QIIME2_EXTRACT {
      cpus   = 8
      memory = 12.GB
      time   = 500.h
      }

}

N E X T F L O W ~ version 23.04.3 nf-core/ampliseq v2.10.0

Sep 06 '24 14:09 aimirza

I think I got It to work after increasing the number of CPUs but now I have another problem. Apparently It is running out of space "[Errno 28] No space left on device" when running QIIME2_PREPTAX:QIIME2_TRAIN, even though I have 3TB left on my device. Any idea what the issue is?

Sep 07 '24 14:09 aimirza

Hi there, this is going way out of the scope of this issue (adding gg2 database). Your problems are not related to gg2, but to executing a large job on your hpc. The error on too less space is most likely related your hpc setting for tmp/scratch data, please contact you sys admin.

Sep 09 '24 06:09 d4straub

I need to know a couple things about using the gg2 database. When using the process QIIME2_TRAIN on the gg2 database, which is a high process job with 1 cpu, what is the minimum memory it requires? Second, where are the tmp files being stored when running QIIME2_TRAIN? The log says "Debug info has been saved to /tmp/qiime2-q2cli-err-cegyux3s.log" but no such file exists in that directory, nor is it in the tmp directory TMPDIR I defined before running the pipeline.

Sep 09 '24 21:09 aimirza

To reduce memory usage, I will add the parameter --p-classify--chunk-size 10000 (default 20000) to the qiime feature-classifier fit-classifier-naive-bayes plugin in the modules/local/qiime2_train.nf module. I'll let you know if it works.

Sep 10 '24 01:09 aimirza

QIIME2_TRAIN is dumping data into /tmp directory. The process is putting files into /tmp and not into my specified /scratch directory. The issue is likely due to the /tmp folder and not cpu memory (which was set to 86.GB). Each compute node has limited space since they are primarily used for computation. Most of the storage is located in our separate directories under /scratch. It's strange because we’ve already set the tmp folder for Nextflow to /scratch/group_share/tmp/. Ive also set the following tmp directories in the script before running the pipline:

export TMPDIR="/scratch/path/to/directory/"
export TEMP="/scratch/path/to/directory/"
export TMP="/scratch/path/to/directory/"
export QIIME2_TMPDIR="/scratch/group_share/tmp/amirza/data/"
export JOBLIB_TEMP_FOLDER="/scratch/group_share/tmp/amirza/data/"

export NXF_WORK="/scratch/group_share/nextflow_workdir/amirza"
export NXF_TEMP="/scratch/group_share/tmp/amirza/data/"
export SINGULARITY_TMPDIR="/scratch/group_share/tmp/amirza/"
export NXF_SINGULARITY_CACHEDIR="/scratch/group_share/singularity_imgs/"
export APPTAINER_TMPDIR="/scratch/group_share/tmp/amirza/"
export APPTAINERENV_TMPDIR="/scratch/group_share/tmp/amirza/"
export SINGULARITYENV_TMPDIR="/scratch/group_share/tmp/amirza/"
export SINGULARITY_CACHEDIR="/scratch/group_share/singularity_imgs/"

None of that worked, but... I finally got it to work after about weeks of trying, HURRAY!!

To address the issue, I created an additional configuration file that includes the following adjustments passed as a file to the parameter -c:

Binding the /scratch Directory in Singularity:

singularity {
    runOptions = '--bind /scratch/group_share/tmp/:/scratch/group_share/tmp/'
}

This command explicitly binds the /scratch/group_share/tmp/ directory to the same path within the Singularity container. By binding this directory, any temporary files created by the qiime2 process/plugins are directed to the larger storage area in /scratch rather than the limited local /tmp directory on the compute nodes.

Setting Environment Variables for Temporary Directories:

process {
    withName: 'QIIME2_TRAIN' {
        scratch = true
        
        // Set environment variables explicitly
        env.TMPDIR = '/scratch/group_share/tmp/amirza/data/'
        env.TEMP = '/scratch/group_share/tmp/amirza/data/'
        env.TMP = '/scratch/group_share/tmp/amirza/data/'
        env.QIIME2_TMPDIR = '/scratch/group_share/tmp/amirza/data/'
        env.JOBLIB_TEMP_FOLDER = '/scratch/group_share/tmp/amirza/data/'

        env.SINGULARITY_CACHEDIR = '/scratch/group_share/singularity_imgs/'
        env.APPTAINER_CACHEDIR = '/scratch/group_share/singularity_imgs/'
    }
}

These environment variables (TMPDIR, TEMP, TMP, QIIME2_TMPDIR, and JOBLIB_TEMP_FOLDER) might be used by various tools (such as qiime2 plugins) and processes to define where temporary files are stored. By setting these explicitly to /scratch/group_share/tmp/amirza/data/, I redirected the storage of temporary files from the limited /tmp directory to a designated area with sufficient space.

Additionally, setting SINGULARITY_CACHEDIR and APPTAINER_CACHEDIR ensures that the container caching mechanisms also use the allocated /scratch space, avoiding the use of local directories that might be space-constrained.

Would you know which specific changes likely fixed the problem?

With 8 cpus of 10GB each, I finished classifying my ASVs in 20 hours.

Oct 03 '24 00:10 aimirza

Thanks for detailing the solution! Did you figure out whether singularity runOptions was needed in addition to all the TMP and CACHEDIR settings, or just the latter?

Oct 07 '24 06:10 d4straub

Actually, binding Singularity to the specified directory using singularity { runOptions... was essential and specifying TMP and TMPDIR was not enough. I discovered that the variables (QIIME2_TMPDIR, JOBLIB_TEMP_FOLDER, etc.) were not defined and not used by the container by adding a check in the QIIME2_EXTRACT script that printed whether each variable was set.

Oct 08 '24 14:10 aimirza