dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

High loss of CCS reads in filtering step

Open effelsbn opened this issue 2 years ago • 4 comments

Hello!

In my current project, we used DADA2 on a set of 16S Full-length PacBio CCS reads including the Zymo Mock sample. Read tracking showed that I lost ~50% of the reads after the filtering step in these samples. While the final results were still fine and showed the expected species, I was wondering why all these reads are filtered given that they are Hifi reads from high-quality DNA and should be of very good quality already? Maybe it is because I lack understanding of how the minQ parameter translates to the PB quality scores?

Here is the read track: sampleID ccs primers filtered denoised nonchim retained zymomock1 42390 34416 23522 21686 19071 0.45 zymomock2 51454 41611 27211 24996 21819 0.42 zymomock3 62698 50921 33364 30719 26396 0.42

I used filterAndTrim(minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2). The read length distribution looked good with the expected peak at ~1450bp.

Is there a smart way to inspect the discarded reads and find out what caused them to be dropped?

Thank you!

effelsbn avatar Feb 03 '22 15:02 effelsbn

@effelsbn how are you processing the data (e.g. trimming, denoising, running chimera detection, etc)? If I'm reading this right, it looks like you had some loss at two stages, primer detection/removal and filtering? We've also used only CCS calls based on @benjjneb 's lima recommendations (99.9% consensus); default CCS calls are normally 99%.

cjfields avatar Feb 08 '22 20:02 cjfields

@cjfields Thank you! Indeed, the provided CCS were >QV20, so now it makes sense that so many reads were filtered out with my parameters. There is almost no filtering loss now if I use QV30. You're right, we also lost about 20% of the reads during primer removal but I am not sure if this is a normal fraction? It seems to be common.

effelsbn avatar Feb 11 '22 11:02 effelsbn

@effelsbn I'd have to go back and check but IIRC that stage is pretty stringent (triages reads if both primers aren't found); our Shoreline run does the same. I agree, it would be nice to have something in filterAndTrim that could optionally save failing reads so they can be evaluated more, I don't think this is in place at the moment. We have also toyed around with possibly trying Cutadapt as a replacement for this step for more flexibility, just haven't got around to it.

cjfields avatar Feb 11 '22 17:02 cjfields

@cjfields Yes, I think such an option for filterAndTrim would be great in the future. Should we leave this issue open as a feature request? Anyways, for now, I consider my problem "solved", so thank you very much!

effelsbn avatar Feb 14 '22 09:02 effelsbn

hi @effelsbn

I see that you had your problem solved. I am having a similar problem. Can you please share the parameters you used to resolve the issue? Was it the minQ parameter that was causing the filtering problem?

Thanks!

nhg12 avatar Jan 18 '23 08:01 nhg12

Hi @nhg12

So in our case, we took CCS reads that were exported with a threshold of QV20 (~99% consensus) but minQ=3 which explained why such a large fraction of reads was additionally filtered out. When I tried with QV30 reads, the filtered fraction was much lower but in the end the total number of reads left for analysis was the same so it doesn't really matter at what point you filter. For my understanding, 99.9% is also quite conservative, you could probably also try minQ=2 if you don't have enough reads, but I am not an expert and might be corrected here (@benjjneb , @cjfields ?).

effelsbn avatar Jan 18 '23 08:01 effelsbn

We have found much the same and pretty much stick with 99.9% consensus CCS calls, though would be good to hear @benjjneb 's thoughts; I'm also curious whether DeepConsensus calls will have an impact. I'm also not sure (w.o additional testing using appropriate mocks) if there is much benefit, e.g. possibly capturing some strain-level differences (and if this is the goal, I'd argue other approaches are better for this, e.g. Shoreline's StrainID or whole metagenome).

cjfields avatar Jan 18 '23 16:01 cjfields

pretty much stick with 99.9% consensus CCS call

That's what we start with before running the DADA2 workflow. Usually also include a loose maxEE filter and a length window at the filterAndTrim step too.

benjjneb avatar Jan 18 '23 17:01 benjjneb

Hi @benjjneb @effelsbn @cjfields for your quick replies. I was actually doing the analysis in QIIME2 and now would give it a try in R for more flexibility. I am just attaching the results here (I know it is in a different program) just to give an idea of what I am getting. Any thoughts?

The command and parameters i set are as below: qiime dada2 denoise-ccs --i-demultiplexed-seqs ./samples.qza --o-table dada2-ccs_table.qza --o-representative-sequences dada2-ccs_rep.qza --o-denoising-stats dada2-ccs_stats.qza --p-min-len 1000 --p-max-len 1600 --p-max-ee 2 --p-front AGRGTTYGATYMTGGCTCAG --p-adapter RGYTACCTTGTTACGACTT --p-n-threads 8

Screen Shot 2023-01-19 at 10 12 39 am Thanks!

nhg12 avatar Jan 19 '23 02:01 nhg12

I tried setting max-ee to 10 and 20 and still 60% of my reads are getting filtered. I don't understand why my sequences are getting filtered when they are high quality

nhg12 avatar Jan 19 '23 12:01 nhg12

Just a quick note: I don't use QIIME2 primarily so I have better control on the data processing steps.

Simple before/after numbers don't give enough definition to resolve the problem, apart from pointing 'this is the step you lose stuff'; you need the 'why'. In other words, you need more defined metrics (overall length distribution prior to filtering, overall base content, etc) prior to the filtering step to better understand why you see this. Does QIIME2 provide sequence data just prior to the filtering stage (after primer removal)?

cjfields avatar Jan 19 '23 15:01 cjfields

No QIIME2 shows initial quality of the data and then the final stats after DADA2 pipeline. The length of my sequences is ~1550 bp after primer removal.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Chris Fields @.> Sent: Thursday, January 19, 2023 11:19:02 PM To: benjjneb/dada2 @.> Cc: Noor-Ul-Huda Ghori @.>; Mention @.> Subject: Re: [benjjneb/dada2] High loss of CCS reads in filtering step (Issue #1489)

Just a quick note: I don't use QIIME2 primarily so I have better control on the data processing steps.

Simple before/after numbers don't give enough definition to resolve the problem, apart from pointing 'this is the step you lose stuff'; you need the 'why'. In other words, you need more defined metrics (overall length distribution prior to filtering, overall base content, etc) prior to the filtering step to better understand why you see this. Does QIIME2 provide sequence data just prior to the filtering stage (after primer removal)?

— Reply to this email directly, view it on GitHubhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbenjjneb%2Fdada2%2Fissues%2F1489%23issuecomment-1397138759&data=05%7C01%7C00093057%40uwa.edu.au%7Cff703ef42c2542695e9508dafa308081%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638097383479207395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=16KQfu9eKZ1g%2Bf72SKd5opwFrm6rSN8LMorm8ArT8CA%3D&reserved=0, or unsubscribehttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQL6QZT5KA4KLQUKYRPG7HTWTFLONANCNFSM5NPIF36Q&data=05%7C01%7C00093057%40uwa.edu.au%7Cff703ef42c2542695e9508dafa308081%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638097383479207395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LKbx%2B9DQGR8818zGMrDMwqBkbtr6Y80c3JjMUbOQVOk%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

nhg12 avatar Jan 19 '23 23:01 nhg12

Now i ran one sample through R DADA2 pipeline so I could change some more parameters, Tis is what I ran:track <- fastqFilter(nop, filt, minQ=2, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=30, verbose=TRUE)

I got really good numbers when I filtered. Not losing many sequences however, I am now losing a lot after denoising. Any thoughts there?
ccs primers filtered denoised [1,] 25530 23361 23000 10633

dada-class: object describing DADA2 denoising results 82 sequence variants were inferred from 22350 input unique sequences. Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 32

Rplot

nhg12 avatar Jan 20 '23 02:01 nhg12

The learned error rates look really high for HiFi sequencing. What chemistry/instrument version are you using? What is the environment/sample-type you are sequencing?

benjjneb avatar Jan 24 '23 14:01 benjjneb

We used PacBIO sequel II. These are skin swab samples

nhg12 avatar Jan 25 '23 03:01 nhg12

Huh... I don't know, that should be handled pretty well by learnErrors from my experience.

A couple more things to try: plotComplexity on a couple sample fastqs after after primer removal/filtering, to check for any low-complexity contamination. And try pulling out like 20 random sequences and BLAST-ing them againt nt online. Are they all hitting expected bacterial taxa for the skin?

benjjneb avatar Jan 25 '23 15:01 benjjneb

Yes, super odd. As a contrast, here is a recent run we have from a Sequel IIe run (CCS calls at 99.9%) which worked wonderfully:

image

I'd have to go back and dig through older runs to find an example, but IIRC this looks a bit like what we had when running DADA2 using CCS reads with a lower consensus cutoff, such as the default (which I believe is 99%). How are you running ccs (e.g. the actual command line call or SMRTLink settings)?

cjfields avatar Jan 25 '23 17:01 cjfields

I am guessing it is on SMRTLink settings. I didnt run ccs myself i got the ccs demultiplexed reads from the sequencing company. I filtered the sequences with a minQ=2 because if i run at 3 i lose >60% of my reads. The sequencing run was really good i dont know why am i having issues with the analysis Do you think we need to revise the SMRTLink settings?

Get Outlook for iOShttps://aka.ms/o0ukef


From: Chris Fields @.> Sent: Thursday, January 26, 2023 1:04:00 AM To: benjjneb/dada2 @.> Cc: Noor-Ul-Huda Ghori @.>; Mention @.> Subject: Re: [benjjneb/dada2] High loss of CCS reads in filtering step (Issue #1489)

Yes, super odd. As a contrast, here is a recent run we have from a Sequel IIe run (CCS calls at 99.9%) which worked wonderfully:

[image]https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F49265%2F214627276-ac38ed1a-0b37-4b4a-bd86-fa78490f5f03.png&data=05%7C01%7C00093057%40uwa.edu.au%7Cad98beedfa064499a75608dafef628b5%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638102630452237112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JbH3B0VptotFlTTgNxM2c3144SesfXUGLmogophd8JE%3D&reserved=0

I'd have to go back and dig through older runs to find an example, but IIRC this looks a bit like what we had when running DADA2 using CCS reads with a lower consensus cutoff, such as the default (which I believe is 99%). How are you running ccs (e.g. the actual command line call or SMRTLink settings)?

— Reply to this email directly, view it on GitHubhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbenjjneb%2Fdada2%2Fissues%2F1489%23issuecomment-1403941114&data=05%7C01%7C00093057%40uwa.edu.au%7Cad98beedfa064499a75608dafef628b5%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638102630452237112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=0fHwMuIXubGJ2plhXn4A7k%2BkZ143aPlmIBjPbKcsrcA%3D&reserved=0, or unsubscribehttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQL6QZUTKEGWJOZ7TDSSFZDWUFMIBANCNFSM5NPIF36Q&data=05%7C01%7C00093057%40uwa.edu.au%7Cad98beedfa064499a75608dafef628b5%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638102630452237112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YXOqjA%2Bf%2BulSZ9Qnze9KwGj43CpDHtskT%2Bi%2FcfPRyNM%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

nhg12 avatar Jan 25 '23 17:01 nhg12

@nhg12 They likely gave you either:

  • The 99% consensus calls, or
  • An all-inclusive run with all reads that needs to be filtered (and in the latter case they should have supplied a BAM file with the relevant information). Something like this: https://ccs.how/faq/reads-bam.html

It depends on the center, I would def. contact them and ask. Hopefully you have the BAM file in either case; if you have the FASTQ you need to have them provide you with either the BAM (which can be pretty big) or with FASTQ consensus calls that are higher accuracy.

If they gave you a PacBio BAM file there is a tag with the predicted read accuracy (rq tag) which can be used for filtering the data to the accuracy you need.

cjfields avatar Jan 25 '23 17:01 cjfields

ccs_report1.csv

Do you mean the read score in this file? this is one of the reports as I the other data is too big i cant share it here

nhg12 avatar Jan 26 '23 05:01 nhg12

Do you mean the read score in this file? this is one of the reports as I the other data is too big i cant share it here

It looks like the second column might be the read accuracy (which appears to be 99%). Generally for DADA2 you would need the reads that are 0.999 or better; once you have the read names you may need to use a tool like seqtk or seqkit to extract the reads by ID from the FASTQ.

cjfields avatar Jan 26 '23 21:01 cjfields