dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Concatenate improvement suggestion

Open jgrzadziel opened this issue 5 years ago • 8 comments

Recently I was comparing the results from the ITS analysis, using three possible ways of reads merging:

  1. using only forward reads (F)
  2. using paired-end reads (PE)
  3. using concatenate function (C)

The concatenated reads resulted in much more number of identified taxa and prevented from huge reads lost (only ~45-60% reads were retained using PE approach while using C, over 90%, obviously after quality control and chimera removal). It's because a very known problem of ITS (which's length is very variable, exceeding sometimes the pairing possibility), especially ITS-1.

However, sometimes when Fw and Rev reads match together almost perfectly, it is unnecessary or even undesirable to use "concatenate" function.

So my suggestion is to make a hybrid function, for example:

  1. In the first stage try to merge pair-end reads (e.g. min 12 nt overlap)
  2. If there is no >12 overlapping region, then just concatenate

I suppose it would improve the resulted sequences table and taxonomic classification.

What do you think? What are your opinions about concatenating reads in general?

Best wishes

jgrzadziel avatar Sep 20 '18 13:09 jgrzadziel

Was just talking about this with @nagasridhar a couple days ago actually.

It's not totally trivial due to some technical issues related to differentiating between non-overlap and mismatched overlaps, but I think it would be the best choice in some situations.

benjjneb avatar Sep 20 '18 16:09 benjjneb

@benjjneb I do have some code that does this, implemented as a workaround in our workflow code (specifically for ITS and other variable length sequences). If I could get it ready and tested, would you want something like this as a pull request?

cjfields avatar Mar 08 '22 05:03 cjfields

@cjfields Apologies for taking so long to respond here.

With the passage of time, I have become more convinced that using just the forward read instead of trying to merge across sometimes-overlapping loci like ITS is preferable, at least as a general recommendation. Long way of saying, we are no longer looking to add this functionality to the package.

If you have code to do this that you'd like to share though, please consider sharing here. Folks do find these threads, and even if not added to the package we'd potentially be interested in linking to it as an alternative in the ITS workflow.

benjjneb avatar Apr 08 '22 03:04 benjjneb

@benjjneb I can add a link here, it's essentially a modified version of mergePairs: https://github.com/h3abionet/TADA/blob/dev/templates/MergePairs.R#L6 (note this is in a Nextflow template file for generating a R script, so there are some spots with escaped symbols).

The main concern is having something that is in sync with the original code, though it doesn't appear to have changed much at all.

cjfields avatar Apr 09 '22 03:04 cjfields

I do expect to get a request in the future (as we have in the past) for the kind of functionality you have implemented. Is there anything more descriptive than that code I could link to? It's not necessary and the expert will do fine with just the code, but even something like the help text for a function with an example function call can make a big difference for less expert R users.

The main concern is having something that is in sync with the original code, though it doesn't appear to have changed much at all.

At this point our main devel priorities for DADA2 relate to long-read technologies, so I don't expect the paired-end stuff to change much.

benjjneb avatar Apr 11 '22 23:04 benjjneb

I do expect to get a request in the future (as we have in the past) for the kind of functionality you have implemented. Is there anything more descriptive than that code I could link to? It's not necessary and the expert will do fine with just the code, but even something like the help text for a function with an example function call can make a big difference for less expert R users.

The main concern is having something that is in sync with the original code, though it doesn't appear to have changed much at all.

At this point our main devel priorities for DADA2 relate to long-read technologies, so I don't expect the paired-end stuff to change much.

Yes I could add something simple.

At this point our main devel priorities for DADA2 relate to long-read technologies, so I don't expect the paired-end stuff to change much.

That's good to know and was my presumption as well; this is the direction that most new projects have been going here as well: standard 16S, Shoreline, and some ITS and custom amplicons (not to mention Loop). I do think there will be some large scale analyses that would still pick the NovaSeq route, though I do hear that Illumina has a long-read technology coming along very soon so maybe that will change things...

cjfields avatar Apr 12 '22 00:04 cjfields

I am also looking for a way to use a dual merging/concatenation approach. I would be reluctant to just use the forward read due to its shorter length and therefore lower taxonomic resolution. I was wondering if using a mapping step (e.g. bbmap, bowtie2, bwa, or similar) on the reads than cannot be merged could help to identify those which should be concatenated (insert size calculated for mapping positions too long for merging) and those that should not be concatenated (insert size short enough for merging considering minimum overlap settings, but too many mismatches to have been successful). I will play around with this idea a bit more and share what I find here.

chassenr avatar May 11 '22 07:05 chassenr

Just FYI, the approach of mapping unmerged reads and then concatenating only those which have an insert size that would be too long for merging, seems like a valid approach. With the data sets that I have tested so far, I was able to 'rescue' some (although not a lot) of ASVs that look like real sequences and not just concatenated junk. Below is the code that I used in case you are interested. I started with bwa as mapper, but this program does not work well for low sequence numbers. BBmap was reasonably fast.

In R:

# settings for trimming and merging
truncLen_r1 = 230
truncLen_r2 = 220
minOverlap <- 10

# merge denoised reads, keeping info about unmerged sequences in output
mergers <- mergePairs(
  dadaFs,
  filtFs, 
  dadaRs, 
  filtRs, 
  minOverlap = minOverlap,
  verbose = TRUE,
  returnRejects = TRUE
)

# write unmerged reads to file
# sequence names are composite of sample index (position in mergers list) and row index of each mergers object for each sample
# this way, it should be fairly easy to identify the candidates for concatenation in the mergers object
unmergedFs <- lapply(
  1:length(mergers),
  function(x) {
    tmp <- dadaFs[[x]]$sequence[mergers[[x]]$forward[!mergers[[x]]$accept]]
    names(tmp) <- paste0(x, "_", which(!mergers[[x]]$accept), "/1")
    return(tmp)
  }
)
unmergedRs <- lapply(
  1:length(mergers),
  function(x) {
    tmp <- dadaRs[[x]]$sequence[mergers[[x]]$reverse[!mergers[[x]]$accept]]
    names(tmp) <- paste0(x, "_", which(!mergers[[x]]$accept), "/2")
    return(tmp)
  }
)
fastaFs <- unlist(unmergedFs)
fastaRs <- unlist(unmergedRs)
writeFasta(fastaFs, file = "Fs.fasta")
writeFasta(fastaRs, file = "Rs.fasta")

In bash:

# map against reference DB 
bbmap.sh ref=SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta in=Fs.fasta in2=Rs.fasta out=unmerged_bbmap.bam threads=50

# remove supplementary alignments, require minimum alignment length of 50bp, only look at first read of proper pairs
samtools view -F2304 -f66 -m50 unmerged_bbmap.bam | cut -f1,9 > unmerged_is.txt

Back to R:

# get insert size from bam file
unmerged_is <- read.table("unmerged_is.txt", h = F, sep = "\t", col.names = c("seqID", "insert"))

# select candidates for concatenation: insert size exceeds maximum length of merged sequences
unmerged_is_long <- unmerged_is[unmerged_is$insert > (truncLen_r1 + truncLen_r2 - minOverlap), ] %>% 
  separate(seqID, into = c("sample_index", "row_index"), sep = "_", remove = F, convert = T)

# retrieve and concatenate sequence
mergers_rescued <- mergers
# for each sample
for(i in 1:length(mergers)) {
  # if there are concatenation candidates in that sample
  if(i %in% unique(unmerged_is_long$sample_index)) {
    # get row index for those candidates
    tmp_index <- unmerged_is_long$row_index[unmerged_is_long$sample_index == i]
    # if there are any
    if(length(tmp_index) > 0) {
      # replace sequence and values for other parameters in mergers object as suggested in:
      # https://github.com/h3abionet/TADA/blob/dev/templates/MergePairs.R#L6
      mergers_rescued[[i]]$sequence[tmp_index] <- paste0(
        fastaFs[paste0(unmerged_is_long$seqID[unmerged_is_long$sample_index == i], "/1")],
        "NNNNNNNNNN", 
        rc(fastaRs[paste0(unmerged_is_long$seqID[unmerged_is_long$sample_index == i], "/2")])
      )
      mergers_rescued[[i]]$nmatch[tmp_index] <- 0
      mergers_rescued[[i]]$nmismatch[tmp_index] <- 0
      mergers_rescued[[i]]$nindel[tmp_index] <- 0
      mergers_rescued[[i]]$prefer[tmp_index] <- NA
      mergers_rescued[[i]]$accept[tmp_index] <- TRUE
      # discard unmerged and non concatenated reads
      mergers_rescued[[i]] <- mergers_rescued[[i]][mergers_rescued[[i]]$accept, ]
    }
  } else {
    # otherwise, discard all unmerged reads
    mergers_rescued[[i]] <- mergers_rescued[[i]][mergers_rescued[[i]]$accept, ]
  }
}

# make sequence table
seqtab_rescued <- makeSequenceTable(mergers_rescued)

chassenr avatar May 18 '22 07:05 chassenr