dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Faster collapseNoMismatch()

Open masumistadler opened this issue 5 years ago • 29 comments

Hello Benjamin,

I've merged sequence tables of several dada() runs (in summary around 680 samples) and I am running the collapseNoMismatch() right now and it takes quite a while (it's running for at least a day).

Now, I have noticed that this function does not have a multithread = TRUE option as most of your computationally more expensive functions do. I was wondering whether there is a way to enable multithreading. It would be great to speed up the process.

masumistadler avatar Nov 28 '18 14:11 masumistadler

There currently is not, but this could be an enhancement request. We have not really run into cases where this is the rate limiting step before, so haven't made it a priority, but we can consider it in the future.

benjjneb avatar Nov 28 '18 20:11 benjjneb

That would be great for the future. Thank you!

masumistadler avatar Nov 28 '18 20:11 masumistadler

I am using a set of primers that are very 'slippy' so I generate many different lengths of merged reads for the same biological sequence. collapseNoMismatch seems like just the thing to sort this out but it takes a while!

A multithreaded version would be really useful.

leholman avatar May 08 '19 17:05 leholman

For your information, the collapseNoMismatch() step took around 4 days for 630 16S amplicon sequencing samples with approx. 199000 ASVs.

masumistadler avatar May 08 '19 19:05 masumistadler

Thanks for the reports. I've added this issue to the 1.14 milestone, which is our next release in ~6 months.

No promises, but it's at least on the formal agenda now!

benjjneb avatar May 08 '19 19:05 benjjneb

🥺 4 days. Thanks for the info Masumi!

I'll try and benchmark the function while I run it on the various datasets and post the results up here.

leholman avatar May 08 '19 20:05 leholman

I have finished running the function on some datasets. See benchmarks below.

Dataset 1: marine eDNA metabarcoding of COI of ~70 samples: 37674 OTUs - 7476 seconds Dataset 2: marine eDNA metabarcoding of 18S of ~70 samples: 10260 OTUs - 12029 seconds Dataset 3: marine eDNA metabarcoding of 16S of ~70 samples: 19228 OTUs - 6210 seconds

Roughly 1% of sequences for the COI dataset were merged, while only 2 were merged for the 18S dataset. I pulled the code for the function and added a counter to monitor progress. The code seems to run exponentially slower as it runs through the dataset.

I also ran the RStudio code profiler on the function (http://rpubs.com/lholman/codeprofile1). Looks like grepl, substr and an if statement use most of the resources. I run into problems like this with my own code so I'm interested in different approaches to speed up these functions.

leholman avatar May 09 '19 14:05 leholman

Hi, Curious if there is now a beta version of the parallelized collapseNoMismatch command? This seems to be bottleneck in collapsing dozens of sequencing runs

ptpell77 avatar Aug 17 '20 20:08 ptpell77

Curious if there is now a beta version of the parallelized collapseNoMismatch command? This seems to be bottleneck in collapsing dozens of sequencing runs

Nope there is not. But more reports that this is a relevant bottleneck will help push this further up the queue, so thank you.

benjjneb avatar Aug 17 '20 21:08 benjjneb

Hi Benjamin, I agree that this step is really slowing up the processing of multiple plates, however along with that I am curious, if processing say 60 plates wouldn't it be better for the collapse no mismatch step to come at the end of the process, so that when you combine multiple plates you are not introducing potential mismatches say if seq A and B are different by 1 nucleotide at the end of the seq, and in plate 1 A > B but in plate 2 B> A then when combined you will once again have both A and B even though they are 1 nucleotide different in length without mismatches...??

ARBramucci avatar Aug 24 '20 06:08 ARBramucci

Hi, I run the collapseNoMismatch() step always on the final sequence table that includes all plates. I do not really see the point of doing it on all the individual sequence tables separately. To my knowledge, if one would collapse ASVs on a plate basis, if you have a collapsed ASV from plate A and collapsed ASV from plate B, they will only be merged together in the merged sequence table if they have the exact same sequence. And they may be different, if you check the help page of the function, it says, the sequence that is being kept for the collapsed ASV is that of the most abundant. This means, if the sequence that is most abundant for a collapsed ASV is the same across all plates, they should be merged together when merging sequence tables, but if not, they will appear as two separate ASVs. You could run the collapse on a plate to plate basis, but I would suggest running another collapse on the final sequence table. Hope this answers your concern.

masumistadler avatar Aug 24 '20 13:08 masumistadler

So masumistadler, in order to do as you suggest above, you export your ASV tables as csv files and then you compile all the plates together into one file and feed it back into dada2 for the collapse no mismatch step? and you don't run into formatting issues?

ARBramucci avatar Aug 24 '20 23:08 ARBramucci

No, I save all seqtab files from the individual plates as an .rds file. I read them in and join them with mergeSequenceTables(), it's a function to merge several seqtabs provided by dada2.

Here is some example code:

# read in all sequence tables from plates (there are 24)
merger <- list() # create empty list
# read in data with loop
for(i in 1:24){
merger[[i]] <- readRDS(paste0("./Objects/",i,"_seqtab.rds"))
}

# merge all tables
st.all <- mergeSequenceTables(merger[[1]],merger[[2]],merger[[3]],merger[[4]],merger[[5]],merger[[6]],merger[[7]],merger[[8]],merger[[9]],
merger[[10]],merger[[11]],merger[[12]],merger[[13]],merger[[14]],merger[[15]],merger[[16]],merger[[17]],merger[[18]],merger[[19]],
merger[[20]],merger[[21]],merger[[22]],merger[[23]],merger[[24]])
saveRDS(st.all, "./Objects/all_seqtab.rds") # save

# collapse ASVs with identical sequence
col.st <- collapseNoMismatch(st.all, minOverlap = 20, verbose = T)
saveRDS(col.st, "./Objects/collapsed_seqtab.rds")

Hope this helps!

masumistadler avatar Aug 24 '20 23:08 masumistadler

Thank you so much!!! I will try it now but that looks like it will work!!

ARBramucci avatar Aug 25 '20 00:08 ARBramucci

Hi! Just a +1 that this is something we are noticing may be a bottleneck in our pipeline. We can report actual compute times if that would be helpful too.

zanieb avatar Sep 17 '20 18:09 zanieb

Quick question about the collapse no mismatch and running multiple plates with dada2. I have 2 plates and they have slightly different optimal truncLengths (optimal defined as what % make it all the way through the pipeline), should I optimise them both to use the same TruncLengths or does that not matter if I am doing collapse no mismatch on them after the fact because any different lengths will be collapse together? thanks heaps! Anna

ARBramucci avatar Apr 18 '21 23:04 ARBramucci

@ARBramucci Is this paired-end data? If so, the merged reads should cover the same genetic region (i.e spanning from primer to primer) even if the truncated forward/reverse read lengths were slightly different.

If this is single-end data, I think the best option is to use the same truncation length across the runs.

benjjneb avatar Apr 19 '21 14:04 benjjneb

It is paired-end, 300 bp reads. So yes I agree with you that slightly different lengths should not matter. Thanks a lot for the confirmation! Anna

ARBramucci avatar Apr 19 '21 22:04 ARBramucci

Hi! Is there any update on the multithreading for collapseNoMismatch issue? And/or is there a recommended workaround? I am experiencing an issue where collapseNoMismatch is getting hung up; I'm running my Dada2 workflow Rscript on an AWS instance with 18.04.1-Ubuntu (48 cores & 375G of mem).

I have 34 samples. After chimera removal there are 9,568 sequences remaining. They are from PacBio so they are full-length 16S. Here are the stats: Min. 1st Qu. Median Mean 3rd Qu. Max. 1320 1413 1438 1436 1451 1598

I let the collapseNoMismatch step run for ~12 hours and it's still hung up. I've never experienced this length of time before, though I get that this is a LOT of LONG ASVs.

For a workaround, since these are PacBio HiFi reads I was considering just skipping this step. Would you agree? Or since these all should have the same ends following the universal primer removal, could I assume that any sequences that might be mergeable are within a few bps of each other? That way, I could subset the seqtab.nochim by sequence length and do collapseNoMismatch on the subsets.

alexandrasimas avatar Sep 21 '21 14:09 alexandrasimas

You shouldn't need to run collapsNoMismatch on normal data, including PacBio HiFi data that has had their primers consistently trimmed.

benjjneb avatar Sep 21 '21 15:09 benjjneb

Ah okay fair point! I'll institute a rule to not run collapseNoMismatch if primer trimming has occurred! Can you confirm my thinking that if I hadn't trimmed primers, PacBio HiFi still would not need collapseNoMismatch? Thanks!

alexandrasimas avatar Sep 21 '21 16:09 alexandrasimas

if I hadn't trimmed primers, PacBio HiFi still would not need collapseNoMismatch?

If you hadn't trimmed primes, you would have fundamental problems at the denoising step.

The trimming to the inter-primer sequence also alleviates any problems with library preparations that might put in variable length pads, which I don't think is common in PacBio HiFi, but is seen sometimes with Illumina.

benjjneb avatar Sep 21 '21 17:09 benjjneb

Sure that makes sense. Thank you!!

alexandrasimas avatar Sep 21 '21 20:09 alexandrasimas

Hi any word on speeding up the collapse no mismatch step with a multithread option? I have been collapsing samples for 3 weeks now and I can't really keep waiting on this step...but I need those 100% identical ASVs merged. Is there an alternative way of doing this outside of dada2 if this is a rate limiting step? cheers Anna

ARBramucci avatar Mar 01 '22 20:03 ARBramucci

@ARBramucci No update other than this isn't a priority and probably won't be addressed in the near future. Note that if you have processed your data in such a way that all sequences are starting and ending at the same position (which is normally the case for amplicon sequencing data) this step is not needed.

benjjneb avatar Mar 09 '22 16:03 benjjneb

Hi, I've experience also the problem of time consuming when running the collpaseNoMissmatch function. But, reading the comments here, I doubt about the need of its application in my case. Data come from v4-v5 18s amplicons from 2 Illumina runs.

The code I'm using takes several (in this case just 2, that come from 2 plates) seqtables (obtained after dada(), merge Pairs() and makeSequenceTable() functions) and merges those seqtable:

list.df <- map(seqtables, readRDS)
st.all <- mergeSequenceTables(tables = list.df) 

Then, I apply the chimera removal step:

seqtab.raw <- removeBimeraDenovo(st.all, method="consensus",multithread=TRUE, verbose = TRUE)

the output is trimmed (to keep only ASVs with a particular length): seqtab <- seqtab.raw[,nchar(colnames(seqtab.raw)) %in% seq(360, 410)]

and the collapse step:

final.seqtab <- collapseNoMismatch(seqtab, minOverlap = 20)

Do you considere necessary to apply the collpaseNoMismatch() function here, or it's irrelevant as seqtables were merged before?

Thanks!!!

soluna1 avatar Jul 07 '23 06:07 soluna1

@soluna1 If you used the same primer set and the same trimLeft parameters in each run, there should be no need for a collapseNoMismatch step.

That function is mostly intended for dealing with length variation that can crop up between similar but slightly different primer sets, or if different runs were trimmed differently.

benjjneb avatar Jul 07 '23 12:07 benjjneb

Hi, I wrote a script that wraps CD-HIT-EST for the 100% identity clustering step. I don't have rigorous benchmarking results yet, but for ~16,000 ASVs and 56 samples, I obtained identical results compared to collapseNoMismatch, but in only 30 seconds, which is magnitudes faster. Note that it's not thoroughly tested yet, but maybe you could give it a try. The log file will also tell you which ASVs were collapsed together; this could help compare the two methods.

Let me know if you have any comments or suggestions!

dsamoht avatar Apr 20 '24 16:04 dsamoht

Awesome, thanks @dsamoht !

benjjneb avatar Apr 21 '24 19:04 benjjneb