Feature: option to filter dataset based on quality, rather than length

Open gringer opened this issue 5 years ago • 1 comments

My most recent exploration of nanopore data has indicated that the mean base quality reported in the fastq files has a very good correlation with gap-compressed identities when mapping to an ideal reference genome assembly:

https://twitter.com/gringene_bio/status/1268639851638779909

This suggests that the mean base qualities can be treated as a reasonable proxy for the accuracy of a sequence. In a situation where there's excess sequence above what is necessary for assembly (e.g. 100X instead of 40X), assembly may be improved by excluding reads that are predicted to be low quality.

I've incorporated mean quality calculations into one of my scripts, but it'd be quicker to do this internally with canu. Here's the rough process I ran through to assemble a 500 bp amplicon from nanopore reads prepared using the rapid sequencing / barcoding kit:

## create quality statistics
~/scripts/fastx-compstats.pl reads_all.fastq.gz > compstats_pass.csv
## identify reads over 200bp with mean quality > 97%
awk -F ',' '{if($16 >= 0.97 && $18 >= 200){print $1}}' compstats_pass.csv  > highqual_readNames.txt
## subset reads to extract high-quality reads
~/scripts/fastx-fetch.pl -i highqual_readNames.txt reads_all.fastq.gz > highQual_gr0.97.fastq.gz
## run canu (using settings to force assembly of a 500bp sequence using 200bp+ reads with 150bp overlap)
canu -nanopore highQual_gr0.97.fastq.gz -p canu_1 -d canu_1 genomeSize=65000 minReadLength=200 minOverlapLength=150 stopOnLowCoverage=0 minInputCoverage=0
## Linux version, just in case it helps
uname -a
[output] Linux elegans 5.6.0-1-amd64 #1 SMP Debian 5.6.7-1 (2020-04-29) x86_64 GNU/Linux

The result of this assembly had a 100% match to the reference sequences it was composed from, which I was pretty pleased about - both that it was possible to assemble short amplicons with canu, and that I got excellent assembled accuracy.

It'd be great if this quality-based filtering were implemented in canu as an alternative to the length-based filtering (yes, I am aware that this would require paying attention to quality values of sequences, which I don't think is currently done by canu).

Jun 20 '20 11:06 gringer

I'd rather not incorporate QV filtering specific to Nanopore data directly into the assembly. There are a lot of specialized trimming tools for this.

The current filtering isn't actually length based but overlap-based. It will keep longest well-covered reads so long but erroneous reads will be discarded. You could try restricting the corMaxEvidenceErate to 0.15 or 0.20 and see if that filters out the noisier overlaps anyway.

Jun 23 '20 21:06 skoren