dada2
dada2 copied to clipboard
Failed to run learnErrors
Hi Drs,
I tried to use this software to run learnErrors. I have been running for three or four days with no results, how can I solve it?
reads.in reads.out
Gr1m1-1_R1_001.fastq.gz 19090787 19085183 Gr1m2-1_R1_001.fastq.gz 24668545 24661012 Gr1m3-1_R1_001.fastq.gz 22130791 22124033 Gr1m4-1_R1_001.fastq.gz 22009228 22002957 Gr1m5-1_R1_001.fastq.gz 22695964 22689470 Gr2m1-1_R1_001.fastq.gz 26100066 26091969
set.seed(100) errF <- learnErrors(filtFs, nbases = 2e8, multithread=TRUE, randomize=TRUE) 1905560561 total bases in 32002488 reads from 1 samples will be used for learning the error rates.
These are very large amplicon sequencing samples, about 20M reads per sample. Can you clarify that this is expected? What amplicon are you sequencing? Are primers still on the reads? What environment are you sampling?
Hello Drs
Thanks for your reply. Env: (64 Intel Xeon CPU cores @2.1GHz, 1T memory, and 18T disk space)
FastQc: Gr1m1-1_F_filt.fastq.gz FastQC Report.pdf
VEGF mice microbiome analysis
Thanks
These are very large amplicon sequencing samples, about 20M reads per sample. Can you clarify that this is expected? What amplicon are you sequencing? Are primers still on the reads? What environment are you sampling?
Hi Dr. Benjamin Callahan,
Could you take a look at this issue again?
Currently on vacation. Will be back next week.
So I think the issue you are running into is just the size of your samples is challenging the computational complexity of the DADA2 algorithm. Roughly, running time for processing a single sample will be (a bit under) quadratic in the number of sequencing reads. Since your individual samples have 20M+ reads, even a single sample is taking a lot of time to process.
A couple ways to speed things up: Make sure that there is not artefactual diversity in the data, e.g. unremoved primers, heterogeneity spacers, length variation. Run learnErrors
on a subset sample that is limited to say 1M reads. Filter more aggressively to remove variation introduces by sequencing errors.
So I think the issue you are running into is just the size of your samples is challenging the computational complexity of the DADA2 algorithm. Roughly, running time for processing a single sample will be (a bit under) quadratic in the number of sequencing reads. Since your individual samples have 20M+ reads, even a single sample is taking a lot of time to process.
A couple ways to speed things up: Make sure that there is not artefactual diversity in the data, e.g. unremoved primers, heterogeneity spacers, length variation. Run
learnErrors
on a subset sample that is limited to say 1M reads. Filter more aggressively to remove variation introduces by sequencing errors.
Hi Dr. Benjamin Callahan,
Really appreciate your patience for this.