dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Failed to run learnErrors

Open xuan13hao opened this issue 1 year ago • 6 comments

Hi Drs,

I tried to use this software to run learnErrors. I have been running for three or four days with no results, how can I solve it?

                    reads.in reads.out

Gr1m1-1_R1_001.fastq.gz 19090787 19085183 Gr1m2-1_R1_001.fastq.gz 24668545 24661012 Gr1m3-1_R1_001.fastq.gz 22130791 22124033 Gr1m4-1_R1_001.fastq.gz 22009228 22002957 Gr1m5-1_R1_001.fastq.gz 22695964 22689470 Gr2m1-1_R1_001.fastq.gz 26100066 26091969

set.seed(100) errF <- learnErrors(filtFs, nbases = 2e8, multithread=TRUE, randomize=TRUE) 1905560561 total bases in 32002488 reads from 1 samples will be used for learning the error rates.

xuan13hao avatar Jul 21 '22 03:07 xuan13hao

These are very large amplicon sequencing samples, about 20M reads per sample. Can you clarify that this is expected? What amplicon are you sequencing? Are primers still on the reads? What environment are you sampling?

benjjneb avatar Jul 21 '22 13:07 benjjneb

Hello Drs

Thanks for your reply. Env: (64 Intel Xeon CPU cores @2.1GHz, 1T memory, and 18T disk space)

FastQc: Gr1m1-1_F_filt.fastq.gz FastQC Report.pdf

VEGF mice microbiome analysis

Thanks

xuan13hao avatar Jul 21 '22 15:07 xuan13hao

These are very large amplicon sequencing samples, about 20M reads per sample. Can you clarify that this is expected? What amplicon are you sequencing? Are primers still on the reads? What environment are you sampling?

Hi Dr. Benjamin Callahan,

Could you take a look at this issue again?

xuan13hao avatar Jul 26 '22 04:07 xuan13hao

Currently on vacation. Will be back next week.

benjjneb avatar Jul 26 '22 20:07 benjjneb

So I think the issue you are running into is just the size of your samples is challenging the computational complexity of the DADA2 algorithm. Roughly, running time for processing a single sample will be (a bit under) quadratic in the number of sequencing reads. Since your individual samples have 20M+ reads, even a single sample is taking a lot of time to process.

A couple ways to speed things up: Make sure that there is not artefactual diversity in the data, e.g. unremoved primers, heterogeneity spacers, length variation. Run learnErrors on a subset sample that is limited to say 1M reads. Filter more aggressively to remove variation introduces by sequencing errors.

benjjneb avatar Aug 01 '22 18:08 benjjneb

So I think the issue you are running into is just the size of your samples is challenging the computational complexity of the DADA2 algorithm. Roughly, running time for processing a single sample will be (a bit under) quadratic in the number of sequencing reads. Since your individual samples have 20M+ reads, even a single sample is taking a lot of time to process.

A couple ways to speed things up: Make sure that there is not artefactual diversity in the data, e.g. unremoved primers, heterogeneity spacers, length variation. Run learnErrors on a subset sample that is limited to say 1M reads. Filter more aggressively to remove variation introduces by sequencing errors.

Hi Dr. Benjamin Callahan,

Really appreciate your patience for this.

xuan13hao avatar Aug 02 '22 02:08 xuan13hao