dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

learnErrors memory full

Open ColdySnow opened this issue 1 year ago • 5 comments

Hello!

I tried now several times running "learnErrors" of DADA2 in R. Loading the package worked fine and every step described until that. It runs about an hour, gives as output "5485896480 total bases in 22857902 reads from 1 samples will be used for learning the error rates.". Then it takes a little longer until it sais "KILLED" and both, the process and R, are closed. I found out, that this is because our server has not enough memory capacity (it is 62.5GB).

We don't have any more memory/RAM. So what could we do? Is there any solution you can think of?

I appraciate any kind of help!

Best, Christin - Master Student from University of Cologne

ColdySnow avatar Sep 07 '22 12:09 ColdySnow

Your samples are very deep for amplicon sequencing (~23M reads). Is this expected?

That sample depth is at the edge of what we've targeted, and it will require substantial memory and running time (both scale super-linearly with single-sample depth).

My best practical suggestion is to enforce more stringent filtering, as that is fairly effective at reducing the number of unique sequences in the data and therefore the memory/time requirements.

benjjneb avatar Sep 07 '22 17:09 benjjneb

Ah okay, thank you! I'll try it out.

Thing is that our institution normally uses OTUs. They put all samples together in one, everything marked with a specific barcode to recognize later to which sample the probe belongs to. Thats why we have only one but therefore very large amplicon sequencing.

ColdySnow avatar Sep 08 '22 07:09 ColdySnow

They put all samples together in one, everything marked with a specific barcode to recognize later to which sample the probe belongs to. Thats why we have only one but therefore very large amplicon sequencing.

You're best solution here is to use that barcode to separate the sequences into per-sample fastq files. There are a variety of solutions for this sort of "demultiplexing" out there, although their applicability depends on the specific barcoding that is being performed.

benjjneb avatar Sep 08 '22 12:09 benjjneb

Hello again, I did what you told me and demultiplexed our data. This worked so far, but the whole output turned out as fastq.gz files. Just to be sure: the Quality plotting and the learning error rates too are only functioning with fastq datas, right? So I have to unzip all fastq.gz data, correct?

Thank you for all your support, it really helps me!

ColdySnow avatar Sep 12 '22 06:09 ColdySnow

All functions in the dada2 R package will read gzipped fastq files natively. No need to gunzip them first.

benjjneb avatar Sep 13 '22 16:09 benjjneb