TrimGalore Confusing reporting of number of removed short sequences

Confusing reporting of number of removed short sequences

Open pmoris opened this issue 1 month ago • 6 comments

I noticed that the trimming reports do not show the number of too short reads that were removed in the main summary section, but instead list this at the very end of the report. I stumbled upon this issue while debugging the multiqc output of the nf-core/rnaseq pipeline, which stated that all my reads passed the filters, even though the trimmed fastq files contained fewer reads than my original input ones, since multiqc was parsing the main summary section and skipping this final line (I think). See: https://github.com/MultiQC/MultiQC/issues/1557#issuecomment-2618365525

Some example output files (for paired input files, the behaviour for single-end reads appears to be the same from my tests).

CLI output: Note that the summary says 100% of reads pass the filter, but for R2 many of them are actually discarded in the very last line.

$ `apptainer exec /scratch/antwerpen/203/vsc20380/apptainer/nextflow_cache/depot.galaxyproject.org-singularity-trim-galore-0.6.7--hdfd78af_0.img trim_galore -j 2 -e 0.1 -q 20 -O paired --paired -a AGATCGGAAGAGC ../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R1_001.fastq.gz ../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R2_001.fastq.gz`

/// start of output omitted

=== Summary ===

Total reads processed:               2,000,000
Reads with adapters:                 1,363,071 (68.2%)
Reads written (passing filters):     2,000,000 (100.0%)

Total basepairs processed:   302,000,000 bp
Quality-trimmed:               1,692,198 bp (0.6%)
Total written (filtered):    157,577,465 bp (52.2%)

/// overview of removed sequences per adapter omitted

RUN STATISTICS FOR INPUT FILE: ../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R1_001.fastq.gz
=============================================
2000000 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)

Writing report to '/scratch/antwerpen/203/vsc20380/wbc_depletion/_manual_test_results/paired/subset-2M-104974-001-001_S1_L001_R2_001.fastq.gz_trimming_report.txt'

/// start of output for R2

=== Summary ===

Total reads processed:               2,000,000
Reads with adapters:                   429,014 (21.5%)
Reads written (passing filters):     2,000,000 (100.0%)

Total basepairs processed:   302,000,000 bp
Quality-trimmed:               2,997,672 bp (1.0%)
Total written (filtered):    294,579,396 bp (97.5%)

/// overview of removed sequences per adapter omitted

RUN STATISTICS FOR INPUT FILE: ../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R2_001.fastq.gz
=============================================
2000000 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)

Validate paired-end files subset-2M-104974-001-001_S1_L001_R1_001_trimmed.fq.gz and subset-2M-104974-001-001_S1_L001_R2_001_trimmed.fq.gz
file_1: subset-2M-104974-001-001_S1_L001_R1_001_trimmed.fq.gz, file_2: subset-2M-104974-001-001_S1_L001_R2_001_trimmed.fq.gz

>>>>> Now validing the length of the 2 paired-end infiles: subset-2M-104974-001-001_S1_L001_R1_001_trimmed.fq.gz and subset-2M-104974-001-001_S1_L001_R2_001_trimmed.fq.gz <<<<<
Writing validated paired-end Read 1 reads to subset-2M-104974-001-001_S1_L001_R1_001_val_1.fq.gz
Writing validated paired-end Read 2 reads to subset-2M-104974-001-001_S1_L001_R2_001_val_2.fq.gz

Total number of sequences analysed: 2000000

Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 922448 (46.12%)

Deleting both intermediate output files subset-2M-104974-001-001_S1_L001_R1_001_trimmed.fq.gz and subset-2M-104974-001-001_S1_L001_R2_001_trimmed.fq.gz

Content of the R1 and R2 reports. Note that the summary says 100% of reads pass the filter, but for R2 many of them are actually discarded in the very last line.

//

This is cutadapt 3.4 with Python 3.9.6
Command line parameters: -j 2 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC ../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R1_001.fastq.gz
Processing reads on 2 cores in single-end mode ...
Finished in 20.55 s (10 µs/read; 5.84 M reads/minute).

=== Summary ===

Total reads processed:               2,000,000
Reads with adapters:                 1,363,071 (68.2%)
Reads written (passing filters):     2,000,000 (100.0%)

Total basepairs processed:   302,000,000 bp
Quality-trimmed:               1,692,198 bp (0.6%)
Total written (filtered):    157,577,465 bp (52.2%)

//

RUN STATISTICS FOR INPUT FILE: ../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R1_001.fastq.gz
=============================================
2000000 sequences processed in total

//

This is cutadapt 3.4 with Python 3.9.6
Command line parameters: -j 2 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC ../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R2_001.fastq.gz
Processing reads on 2 cores in single-end mode ...
Finished in 33.76 s (17 µs/read; 3.55 M reads/minute).

=== Summary ===

Total reads processed:               2,000,000
Reads with adapters:                   429,014 (21.5%)
Reads written (passing filters):     2,000,000 (100.0%)

Total basepairs processed:   302,000,000 bp
Quality-trimmed:               2,997,672 bp (1.0%)
Total written (filtered):    294,579,396 bp (97.5%)

//

RUN STATISTICS FOR INPUT FILE: ../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R2_001.fastq.gz
=============================================
2000000 sequences processed in total

Total number of sequences analysed for the sequence pair length validation: 2000000

Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 922448 (46.12%)

What I'm still confused about is that the individual reports already show the information about too short sequences (at the end of the report, not in the summary statistics), even though the above log claims this validation happens afterwards (after the report is already written out). Looking at the command line parameters used for read 1 / read 2, it also seems that cutadapt is run independently for both files, rather than in a combined run (which cutadapt claims is required to make sure reads discarded from one file are also removed from the other, to keep them in sync).

Lastly, when you compare this to the (almost equivalent? Not sure about all the parameters) cutadapt output, you can see that here the number of removed short reads is reported simultaneously with the other info in the main summary / read fate breakdown.

$ apptainer exec ../cutadapt\:5.0--py39hbcbf7aa_0 cutadapt -o out.1.fastq -p out.2.fastq  -e 0.1 -q 20 -m 20 --json=cutadapt.json -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT ../../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R1_001.fastq.gz ../../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R2_001.fastq.gz

This is cutadapt 5.0 with Python 3.9.21
Command line parameters: -o out.1.fastq -p out.2.fastq -e 0.1 -q 20 -m 20 --json=cutadapt.json -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT ../../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R1_001.fastq.gz ../../data/fastq-subset/subset-2M-104974-001-001_S1_L001_R2_001.fastq.gz
Processing paired-end reads on 1 core ...

=== Summary ===

Total read pairs processed:          2,000,000
  Read 1 with adapter:               1,063,555 (53.2%)
  Read 2 with adapter:                 131,161 (6.6%)

== Read fate breakdown ==
Pairs that were too short:             931,882 (46.6%)
Pairs written (passing filters):     1,068,118 (53.4%)

Total basepairs processed:   604,000,000 bp
  Read 1:   302,000,000 bp
  Read 2:   302,000,000 bp
Quality-trimmed:               4,689,870 bp (0.8%)
  Read 1:     1,692,198 bp
  Read 2:     2,997,672 bp
Total written (filtered):    311,374,085 bp (51.6%)
  Read 1:   155,748,969 bp
  Read 2:   155,625,116 bp

The .json files also report the correct number of output reads passing all filters:

  "read_counts": {
    "input": 2000000,
    "filtered": {
      "too_short": 931882,
      "too_long": null,
      "too_many_n": null,
      "too_many_expected_errors": null,
      "casava_filtered": null,
      "discard_trimmed": null,
      "discard_untrimmed": null
    },
    "output": 1068118,
    "reverse_complemented": null,
    "read1_with_adapter": 1063555,
    "read2_with_adapter": 131161
  },

Hope the above makes sense and that I didn't miss something already known.

Jan 28 '25 15:01 pmoris

TrimGalore TrimGalore copied to clipboard

Confusing reporting of number of removed short sequences

TrimGalore
TrimGalore copied to clipboard