dada2 Failed to run learnErrors

Hi Drs,

I tried to use this software to run learnErrors. I have been running for three or four days with no results, how can I solve it?

                    reads.in reads.out

Gr1m1-1_R1_001.fastq.gz 19090787 19085183 Gr1m2-1_R1_001.fastq.gz 24668545 24661012 Gr1m3-1_R1_001.fastq.gz 22130791 22124033 Gr1m4-1_R1_001.fastq.gz 22009228 22002957 Gr1m5-1_R1_001.fastq.gz 22695964 22689470 Gr2m1-1_R1_001.fastq.gz 26100066 26091969

set.seed(100) errF <- learnErrors(filtFs, nbases = 2e8, multithread=TRUE, randomize=TRUE) 1905560561 total bases in 32002488 reads from 1 samples will be used for learning the error rates.

Jul 21 '22 03:07 xuan13hao

These are very large amplicon sequencing samples, about 20M reads per sample. Can you clarify that this is expected? What amplicon are you sequencing? Are primers still on the reads? What environment are you sampling?

Jul 21 '22 13:07 benjjneb

Hello Drs

Thanks for your reply. Env: (64 Intel Xeon CPU cores @2.1GHz, 1T memory, and 18T disk space)

FastQc: Gr1m1-1_F_filt.fastq.gz FastQC Report.pdf

VEGF mice microbiome analysis

Thanks

Jul 21 '22 15:07 xuan13hao

These are very large amplicon sequencing samples, about 20M reads per sample. Can you clarify that this is expected? What amplicon are you sequencing? Are primers still on the reads? What environment are you sampling?

Hi Dr. Benjamin Callahan,

Could you take a look at this issue again?

Jul 26 '22 04:07 xuan13hao

Currently on vacation. Will be back next week.

Jul 26 '22 20:07 benjjneb

So I think the issue you are running into is just the size of your samples is challenging the computational complexity of the DADA2 algorithm. Roughly, running time for processing a single sample will be (a bit under) quadratic in the number of sequencing reads. Since your individual samples have 20M+ reads, even a single sample is taking a lot of time to process.

A couple ways to speed things up: Make sure that there is not artefactual diversity in the data, e.g. unremoved primers, heterogeneity spacers, length variation. Run learnErrors on a subset sample that is limited to say 1M reads. Filter more aggressively to remove variation introduces by sequencing errors.

Aug 01 '22 18:08 benjjneb

So I think the issue you are running into is just the size of your samples is challenging the computational complexity of the DADA2 algorithm. Roughly, running time for processing a single sample will be (a bit under) quadratic in the number of sequencing reads. Since your individual samples have 20M+ reads, even a single sample is taking a lot of time to process.

A couple ways to speed things up: Make sure that there is not artefactual diversity in the data, e.g. unremoved primers, heterogeneity spacers, length variation. Run learnErrors on a subset sample that is limited to say 1M reads. Filter more aggressively to remove variation introduces by sequencing errors.

Hi Dr. Benjamin Callahan,

Really appreciate your patience for this.

Aug 02 '22 02:08 xuan13hao

dada2 dada2 copied to clipboard

Failed to run learnErrors

dada2
dada2 copied to clipboard