Error filterAndTrim applied on MiSeq reads

Open emankhalaf opened this issue 2 years ago • 10 comments

@benjjneb Hi,

I used to manipulate MiSeq reads on Qiime2 where I can use DADA2 plugins along with other q2-plugins easily. I need to export the sequence data as fasta file collapsed at specific taxonomic levels and assigned to new names (ASV1,2,3,...), so I can annotate the tree using other software. I did that with PacBio reads (generated from DADA2 in R) successfully, however, it did not work when applied on phyloseq object generated from biom files (from Qiime2) since I got an error from DNAStringSet dna <- Biostrings::DNAStringSet(taxa_names(ps.gen)) I got this error Error in .Call2("new_XStringSet_from_CHARACTER", class(x0), elementType(x0), : key 51 (char '3') not in lookup table

However, I checked the sequences and did not find any numbers across sequences. The error is common with other users on bioconductor/github forum, but I did not find a solution. Probably, the feature IDs interfere here, I do not know!!!

So, I decided to use DADA2 in R for MiSeq and I am new to the workflow for Illumina reads My sequence files are fastq.gz, so I am following the workflow posted here: https://benjjneb.github.io/dada2/bigdata_paired.html

From the tutorial, I need to assign the path for the forward and reverse sequences, however, both exist in the same folder #filter

File parsing

pathF <- "/home/DADA2/sequences" # CHANGE ME to the directory containing your demultiplexed forward-read fastqs pathR <- "/home/DADA2/sequences" # CHANGE ME ... filtpathF <- file.path(pathF, "filteredf") # Filtered forward files go into the pathF/filtered/ subdirectory filtpathR <- file.path(pathR, "filteredr") # ... fastqFs <- sort(list.files(pathF, pattern="fastq.gz")) fastqRs <- sort(list.files(pathR, pattern="fastq.gz")) if(length(fastqFs) != length(fastqRs)) stop("Forward and reverse files do not match.")

Then, I checked the quality scores for forward sequences which was the same when use fastqRs since both are in the same folder

plotQualityProfile(fastqFs[1:2])

I have a good quality for F and R sequences, and when I applied filterAndTrim

out <- filterAndTrim(fwd=file.path(pathF, fastqFs), filt=file.path(filtpathF, fastqFs), rev=file.path(pathR, fastqRs), filt.rev=file.path(filtpathR, fastqRs), truncLen=c(240,240), maxEE=2, truncQ=2, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=TRUE)

and I got this error: Error in filterAndTrim(fwd = file.path(pathF, fastqFs), filt = file.path(filtpathF, : These are the errors (up to 5) encountered in individual cores... Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, : The output and input file names must be different. Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, : The output and input file names must be different. Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, : The output and input file names must be different. Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, : The output and input file names must be different. Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, : The output and input file names must be different.

Is it because both forward and reverse sequences have the same path? Your help is much appreciated!

Mar 15 '22 15:03 emankhalaf

@benjjneb I tried the regular workflow for fastq files here https://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline and it worked but I want to know is it okay since the files are fastq.gz? Also, I have a few inquiries:

The figures below for the quality of sequences

Based on these graphs, I ran the workflow using different parameters of trimming truncLen=c(220,160), trimLeft= c(19, 20) <- resulted in 1384, after removing organelles 1245 taxa . av.sequence length is 253-257/258, sum(seqtab.nochim)/sum(seqtab) = 0.98 truncLen=c(245,245), trimLeft= c(19, 20) same parameters I used in Qiime2, <- resulted in 1552, after removing organelles 1404 taxa (a little bit different from counts that I got from biom files exported from qiime2). av.sequence length is 253-257/258, sum(seqtab.nochim)/sum(seqtab) = 0.98

Also, I used DETECT_SINGLETON = FALSE So, everytime I change the length, the output is shifted. I also adopted some steps from PacBio to generate the tree and generate the phyloseq object. I just want to make sure that what I did is correct and acceptable since my reads are fastq.gz including primers.

The good news, is that now I can export renamed sequences as a FASTA file, so, it sounds like DNAStringSet fun has a problem with phyloseq object generated from biom files since it has feature-IDs???

Thanks!

Mar 15 '22 20:03 emankhalaf

@benjjneb I ran the same workflow on another set of sequences (another project) since the output sequences that I obtained from qiime2 has a poor taxonomic resolution (majority assigned to the family level since the dominant taxon is Klebsiella). So, I have a couple of questions: 1- Regarding the quality score plot, it shows excellent quality, however, the forward sequence shows a sharp single drop to QS 15 at 50 bp position. However, I ignored this drop and used the same parameters that I used in Qiime: truncLen=c(220,160), trimLeft= c(19, 20). If I trimmed at 50 bp position, the sequences will never overlap, plus I assumed this will be corrected by the algorithm. Any recommendations here?

2- Why the generated taxonomy from DADA2 in R when using these 2 consecutive steps has higher resolution to the genus and species levels

taxa <- assignTaxonomy(seqtab.nochim, "/home/DADA2/silva_nr99_v138.1_wSpecies_train_set.fa.gz", multithread=TRUE)

taxa <- addSpecies(taxa, "/home/DADA2/silva_species_assignment_v138.fa.gz")

Does that mean the trained classifier in Qiime2 is not helpful to identify some microbial communities (e.g., here the community is dominated by Enterobacteriaceae, particularly, Klebsiella)?

Thank you!

Mar 17 '22 16:03 emankhalaf

Hi Eman, It looks like you have made some progress here. In general, it seems as if you are seeing solid results running your data through the DADA2 tutorial workflow, based on your comments:

Based on these graphs, I ran the workflow using different parameters of trimming truncLen=c(220,160), trimLeft= c(19, 20) <- resulted in 1384, after removing organelles 1245 taxa . av.sequence length is 253-257/258, sum(seqtab.nochim)/sum(seqtab) = 0.98 truncLen=c(245,245), trimLeft= c(19, 20) same parameters I used in Qiime2, <- resulted in 1552, after removing organelles 1404 taxa (a little bit different from counts that I got from biom files exported from qiime2). av.sequence length is 253-257/258, sum(seqtab.nochim)/sum(seqtab) = 0.98

Is there any other issue with running the R worfklow that you are having at this point?

it sounds like DNAStringSet fun has a problem with phyloseq object generated from biom files since it has feature-IDs???

QIIME2 renames the ASVs with hash values, instead of the DNA sequences, which DNAStringSet cannot interpret.

1- Regarding the quality score plot, it shows excellent quality, however, the forward sequence shows a sharp single drop to QS 15 at 50 bp position. However, I ignored this drop and used the same parameters that I used in Qiime: truncLen=c(220,160), trimLeft= c(19, 20). If I trimmed at 50 bp position, the sequences will never overlap, plus I assumed this will be corrected by the algorithm. Any recommendations here?

The quality plots you show above look excellent, I don't see a shar drop at 50bp?

2- Why the generated taxonomy from DADA2 in R when using these 2 consecutive steps has higher resolution to the genus and species levels...Does that mean the trained classifier in Qiime2 is not helpful to identify some microbial communities (e.g., here the community is dominated by Enterobacteriaceae, particularly, Klebsiella)?

I can't really say. There are differences in both the method and the reference database between your assignTaxonomy call and whatever you are using inside Q2.

Mar 18 '22 22:03 benjjneb

@benjjneb Many thanks for your recommendations and help!

Is there any other issue with running the R worfklow that you are having at this point?

The workflow is running well, I just want to make sure that using this workflow with my fastq.gz is okay since the tutorial uses fatsq files

QIIME2 renames the ASVs with hash values, instead of the DNA sequences, which DNAStringSet cannot interpret.

That makes sense!

The quality plots you show above look excellent, I don't see a sharp drop at 50bp?

Hereinafter the quality plots of the other project

You will notice a sharp drop at 50 bp position! However, I ignored it and used length 245,245 as above. is it okay?

I can't really say. There are differences in both the method and the reference database between your assignTaxonomy call and whatever you are using inside Q2.

In Qiime, I trained the calssifier of SILVA 138 release on V4 region as posted here https://forum.qiime2.org/t/processing-filtering-and-evaluating-reference-sequence-data-with-rescript/15494, then use this qiime feature-classifier classify-sklearn
--i-classifier silva-138-ssu-nr99-515f-806r-classifier.qza
--i-reads rep-seqs.qza
--o-classification taxonomy.qza

So, in both cases, I am using the same SILVA138 release database, but the method of assigning taxonomy is different. It really makes a big difference in the resolution of the taxonomic identification.

One more thing to add that might be irrelevant to DADA2 workflow. In the last project when I renamed sequences to match the sampleIDs in the metadata file using this:

sample.names1 <- sapply(strsplit(basename(fnFs), "_"), [, 1) sample.names1

Then, sample.names2 <- str_replace_all(sample.names1, "Oxxx-Hxxx-", "M") sample.names2

To produce this [1] "M1" "M10" "M11" "M12" "M13" "M14" "M15" "M16" "M17" "M18" "M19" "M2" "M20" "M21" "M22" "M23" "M24" [18] "M3" "M4" "M5" "M6" "M7" "M8" "M9"

When generating phyloseq object at the end of the workflow:

samdf <- read.csv("/home/Documents/DADA2/metadata.csv", header=TRUE) all(rownames(seqtab.nochim) %in% samdf$SampleID) # TRUE

FALSE

Any recommendations or expectations for the reasons behind this error?

Again, much thanks!

Mar 19 '22 15:03 emankhalaf

You will notice a sharp drop at 50 bp position! However, I ignored it and used length 245,245 as above. is it okay?

Yes, I wouldn't worry about that drop. There can be coherent dips in average quality scores during Illumina runs, it doesn't mean the data is bad.

Any recommendations or expectations for the reasons behind this error?

Many possibilities. There could be an extra sample in the sequencing reads that wasn't in the metadata (e.g. a control sample), or the reformatting of the names could have introduced discrepancies between them. Just have to inspect the sample names from each source and identify potential causes that way.

Mar 22 '22 20:03 benjjneb

@benjjneb

Many thanks for your recommendations. In case the reformatting of the names has introduced discrepancies between samples, how could I solve this problem. Is it acceptable to rename the sequence files manually, especially they are not too many, or is there other solutions by coding?

I also have a set of sequence files and I need to rename them. First, I did a string split to pick up the names of the sequences as submitted to the genomic facility

sample.names1 <- sapply(strsplit(basename(fnFs), "_"), [, 2) sample.names2 <- sapply(strsplit(sample.names1, "\\."), function(x1) paste(x1[4])) sample.names2

This returns: [1] "A-426-8-2cobs-Oct-15-19" "A-121-5-Sept-5-19" "A-112-22-3cobs-Oct-10-19" [4] "A-319-4-3cobs-Oct11-19" "A-225-10-Sept-20-19" "A-201-25-Sept-24-19" [7] "A-306-12-Sept-3-19" "A-123-8-1cob-Oct-15-19" "A-322-5-Sept-5-19" .....

Then to rename this list of names with SampleIDs in the metadata file which are W1, W2, W3,.... upto W83 (however, I have a missing value in this list so, I created a list for the new names and did not use a pattern). To do the replacement I used the following code:

sample.names3 <- str_replace_all(sample.names2, c("A-426-8-2cobs-Oct-15-19","A-123-8-1cob-Oct-15-19", "A-421-19-2cobs-Aug-23-19", "A-303-19-3cobs-Aug23-19", "A-205-19-2cobs-Aug-23-19",.....), c("W1", "W2", "W3", "W4", "W5", "W6", "W7", "W8", "W9",.......)

But this returns renaming for the first sample name only not the entire list. Please see below:

[1] "W1" "A-121-5-Sept-5-19" "A-112-22-3cobs-Oct-10-19" [4] "A-319-4-3cobs-Oct11-19" "A-225-10-Sept-20-19" "A-201-25-Sept-24-19"

So, how can I rename all the samples using the same order in both lists (list to be renamed, and list of new names)?

Thanks again!

Mar 29 '22 16:03 emankhalaf

In this case, I would highly recommend adding the alternative names to your sample metadata file. For example, if there is a spreadsheet where you are keeping your sample metadata information, add a new column that has either the fastq files names, or the prefix of those filenames, in that document. Then, read that into R and use the data.frame to convert between the fastq names and the sample IDs.

Encoding that conversion by hand in the R code is more likely to lead to errors.

Mar 30 '22 14:03 benjjneb

@benjjneb Much thanks for your suggestion! I will try this and let you know. I have another question relevant to Phyloseq object manipulation. I already extracted sequences from the object at different taxonomic levels for further analyses successfully. However, when I extracted the genera sequences at a specific prevalence threshold (e.g., 50% of samples), an error pops up and I could not proceed. So I did as follows:

agglomerate phyloseq object at genus level

ps_genus <- tax_glom(ps_workingc, taxrank = "Genus")

calculate genera that exist in 50% of samples

genera.prev50 <- core_members(ps_genus, detection = 0, prevalence = 50/100) print(genera.prev50)

I got 17 sequences

#Use short names for ASVs

dna1 <- Biostrings::DNAStringSet(taxa_names(genera.prev50)) names(dna1) <- taxa_names(genera.prev50) genera.prev50 <- merge_phyloseq(genera.prev50, dna1) taxa_names(genera.prev50) <- paste0("ASV", seq(ntaxa(genera.prev50))) genera.prev50

The error below generated at taxa_names(genera.prev50) <- paste0("ASV", seq(ntaxa(genera.prev50))) line

"Error in S4Vectors:::normarg_names(value, class(x), length(x)) : attempt to set too many names (2) on GroupedIRanges object of length 0"

I do not know what is wrong here?

One more thing to ask about:
If I calculated the relative abundance at genus level for each sample/replicate, is it acceptable to calculate the avaerage of relative abundances of each genus across replicates of the same group. For example rep1 rep2 rep3 aver
Enterobacter 0.25 0.2 0.1 0.183
Lactococcus 0.75 0.8 0.9 0.816

Knowing that these replicates represent a specific group or genotype, then take these averages to generate a heatmap to compare the microbial community across genotypes. The idea is that the replicates are not consistent in their microbial composition and we do not know whether it is a technical or a biological problem. However, we have some genotypes that are represented by a single sample where their taxa's relative abundances will be used as is without taking the average. I am not sure if this is correct or does not make sense in a microbiome study since the data are not normally distributed.

Any help, please?

Mar 30 '22 15:03 emankhalaf

@emankhalaf I'm happy to provide support for the DADA2 package, and to some degree input/output from the package to others, but we don't provide support for analyses completely outside the DADA2 package.

Apr 06 '22 00:04 benjjneb

@benjjneb Much thanks for your support and help!

Apr 06 '22 15:04 emankhalaf

dada2 dada2 copied to clipboard

Error filterAndTrim applied on MiSeq reads

File parsing

dada2
dada2 copied to clipboard