ampliseq phred score mis-detection by dada2

Description of the bug

I’ve been testing ampliseq on some Element Biosciences data using standard Illumina settings and discovered an interesting issue where DADA2_ERR fails with error that looks like this

  Error in dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction,  : 
    Invalid derep$quals matrix. Quality values must be positive integers.
  Calls: learnErrors -> dada
	  Execution halted

Turns out, the readFastq function that is being called under the hood mis-interprets ElemBio Phred scores as Phred+64, not Phred+33 - probably because maximium score in ElemBio is 55, while for Illumina it is 41.

I was able to work around this by changing qualityType = "Auto" to qualityType = "FastqQuality" in DADA2_ERR module config.

In addition, I needed to adjust DADA2_DENOISING code to use appropriate qualityType in reading fastq files:

        filtFs <- sort(list.files("./filtered/", pattern = "_1.filt.fastq.gz", full.names = TRUE), method = "radix")
        filtRs <- sort(list.files("./filtered/", pattern = "_2.filt.fastq.gz", full.names = TRUE), method = "radix")

        #denoising
        sink(file = "${prefix}.dada.log")
        derepFs <- derepFastq(filtFs, qualityType="FastqQuality")
        dadaFs <- dada(derepFs, err = errF, $args, multithread = $task.cpus)
        saveRDS(dadaFs, "${prefix}_1.dada.rds")
        derepRs <- derepFastq(filtRs, qualityType="FastqQuality")
        dadaRs <- dada(derepRs, err = errR, $args, multithread = $task.cpus)
        saveRDS(dadaRs, "${prefix}_2.dada.rds")
        sink(file = NULL)

I would appreciate if it would be possible to explicitly set qualityType in DADA2_ERR and DADA2_DENOISING modules - though I’m not sure if this should be:

a sequencing platform parameter - this might be useful if any other tweaks would be need. Problem is I’m only testing steps up to DADA2 ASV generation, so won’t be able to advice on any downstream implications
or a separate pipeline parameter setting phred to default (“Auto” ), 33 (“FastqQuality”) or 64 (“SFastqQuality”) - this might be useful for other platforms with non-standard Phred encoding - @erikrikarddaniel suggested this would be a better solution

Command used and terminal output

nextflow run nf-core/ampliseq -r 2.11.0 --input $manifest \
        -c $config -profile singularity --outdir ampliseq --skip_fastqc \
        --FW_primer AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --RV_primer AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA \
        --retain_untrimmed --illumina_pe_its --skip_qiime --skip_taxonomy \
        --ignore_empty_input_files --ignore_failed_trimming --ignore_failed_filtering --skip_barrnap -resume

Relevant files

No response

System information

linux x86_64 nextflow =24.04.02 ampliseq =2.11.0 singularity =3.10.0 local executor

Aug 22 '24 08:08 amakunin

UPD - another place where qualities should be specified is DADA2_FILTNTRIM - qualityType = "Auto" is specified in ext.args section of module config as of ampliseq 2.11.0.

I am also unsure if qualityType should be specified in DADA2_QUALITY - the main functionplotQualityProfile does not seem to accept qualityType as an argument thogh

Aug 22 '24 15:08 amakunin

Thanks for reporting, and thanks for the update. The main developer is on vacation, so nothing will happen for a few weeks. We might come back to you with further questions when he's back.

Aug 23 '24 07:08 erikrikarddaniel

And another small update - unfortunately I won't be able to finish the test run. The reverse reads in my dataset had very low qualities, so did not pass DADA2_DENOISING (1_2.dada.rds file was not generated). As a result DADA2_STATS module failed with dadaRs = readRDS("${denoised[1]}") being transformed into dadaRs = readRDS("null") - maybe it would be possible to add a check if two files are being generated by denoising for paired-end data here?

Aug 28 '24 09:08 amakunin

Hi there,

a separate pipeline parameter setting phred to default (“Auto” ), 33 (“FastqQuality”) or 64 (“SFastqQuality”)

that sounds like a good addition to me. However, test data (that could also be uploaded to our test data repo https://github.com/nf-core/test-datasets/tree/ampliseq) would be extremely important.

DADA2_QUALITY is primarily for visualization and also aids to find length cutoffs for DADA2_FILTNTRIM, so it seems not surprising that DADA2_QUALITY has no quality type setting.

Sep 02 '24 09:09 d4straub

Hi @d4straub - it seems that DADA2_QUALITY also auto-infers Phred score, but in a different way - plotErrors function calls qa function from ShortRead library - I could not figure out if it supports different Phred scores, or just auto-infers Phred+33. In my case it correctly interpreted the data as Phred+33.

I'd be happy to provide some test data, but for now I only have forward reads of decent quality, while all reverse reads have low quality - I'll need to either generate fake Phred scores or wait for better data to appear. Please also keep in mind that my data is not compatible with downstream applications like QIIME, so it might be easier to replace a few Phred chars corresponding to base qualities between 41-55 in your existing test datasets

Sep 03 '24 15:09 amakunin

If only forward reads are ok, isnt that also sufficient for implementing phred score flexibility? I would think so.

Sep 13 '24 08:09 d4straub

Hi @d4straub, apologies for the delay. It seems that we might get proper paired end data soon. I'd prefer to provide that rather than just single end data - our usage of ampliseq relies heavily on the paired reads. Would this extra delay work for you? If not, I could introduce some higher base quality values, e.g. into https://github.com/nf-core/test-datasets/blob/ampliseq/testdata/1_S103_L001_R1_001.fastq.gz and its reverse counterpart - this should be accurate enough representation of ElemBio data...

Sep 23 '24 10:09 amakunin

Hi @amakunin , real data would be much better, than we could use a small subset for continuous integration testing. Please update in case you are able to share data.

Oct 24 '24 08:10 d4straub

I had a more detailed look at the code. Previously derepFastq was removed, see https://github.com/nf-core/ampliseq/issues/348. I do not think to re-introduce derepFastq is a good idea for the general usage, because I understand it has memory disadvantages, see https://github.com/benjjneb/dada2/issues/2043. So I restricted the use of derepFastq to non-Auto settings. Could you check out https://github.com/nf-core/ampliseq/pull/801, i.e. run

nextflow pull d4straub/ampliseq
nextflow pull d4straub/ampliseq -r add-param-for-phred-encoding
nextflow run d4straub/ampliseq -r add-param-for-phred-encoding --quality_type "FastqQuality" <your params as usual>

if that works as expected, only some test data is still missing to make it perfect.

Nov 19 '24 13:11 d4straub

Dear @amakunin , could you test the solution as described above? I would like to have that solved or abandoned.

Jan 14 '25 11:01 d4straub

Dear @d4straub - thanks for providing the update and pinging me, I'll aim to test this asap

Jan 16 '25 14:01 amakunin

Thanks so much @d4straub , the updated workflow works as expected and produces reasonable DADA2 outputs and QC plots with base qualities around 45-46.

I've prepared test data of 1523 read pairs that do contain a lot of Q45 bases and do end up reconstructing around 35 consensus sequences - these are from multiplex amplicon sequencing. Should I submit these data as a pull request in the test data repo https://github.com/nf-core/test-datasets/tree/ampliseq or maybe just post here?

Jan 24 '25 14:01 amakunin

Thanks!

Should I submit these data as a pull request in the test data repo https://github.com/nf-core/test-datasets/tree/ampliseq or maybe just post here?

A PR would be great, but if thats too much then just post them here.

Jan 24 '25 14:01 d4straub

Thanks @d4straub, PR created https://github.com/nf-core/test-datasets/pull/1477

Jan 28 '25 13:01 amakunin

Hi @amakunin , sorry for the late reply. Allocating some time to ampliseq during the hackathon. I tested the elembio est data with nextflow run d4straub/ampliseq -r add-param-for-phred-encoding -profile singularity --input Samplesheet_elembio.tsv --outdir test_elembio --FW_primer AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --RV_primer AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA --retain_untrimmed --illumina_pe_its --skip_qiime --skip_taxonomy --skip_barrnap -resume where Samplesheet_elembio.tsv contains:

sampleID	forwardReads	reverseReads
test2	https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/test2_elembio_R1.fastq.gz	https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/test2_elembio_R2.fastq.gz

and it runs fine. I would have expected an error? There is no need to adjust --quality_type. Therefore the test data seems unsuitable? Or did I miss something?

Mar 24 '25 09:03 d4straub

I have added the new parameter but I have not added a test due to the apparently insufficient test data.

Mar 26 '25 09:03 d4straub

Hi @d4straub, thanks again for looking into this. I'm utterly confused: both d4straub/ampliseq -r add-param-for-phred-encoding and nf-core/ampliseq -r 2.11.0 correctly infer base qualities with the test data I've provided. I'm now re-running the original dataset with both versions to see if I can reproduce the error. My working hypothesis - the offending base qualities were not present in the test file I've provided, but in some other file in the dataset. Will keep you posted!

Mar 28 '25 10:03 amakunin

Very sorry, @d4straub - I cannot reproduce the bug with phred score mis-detection with the original ampliseq version for which I was reporting this behaviour (2.11.0) and the complete dataset I was using back in the day. I'm not sure if there's anything else that can be done here, but if you have any ideas - do let me know

Apr 01 '25 14:04 amakunin

Alright, then we keep it as it is. 2.13.0 was just released and includes the --quality_type parameter. In case the quality type is still an issue, we can revisit the topic. For now I close the issue.

Apr 04 '25 11:04 d4straub