Description of feature

Authors: Juliana Assis, Sebastian Schulz and Albert Palleja

We would like to emphasize the importance to facilitate the processing of prokaryotic bulk RNAseq data using the v3.x nf-core/rnaseq pipelines, as proposed by Matthias Zepper previously (#1085).
Given the growing importance of (large-scale) transcriptomic analyses in microbial biology and biotechnology, there is a need for standardized and thus reproducible processing of prokaryotic RNAseq data (bacterial and archaeal). Although bacterial RNAseq data have been processed with nf-core/rnaseq pipelines previously, this was – to the best of our knowledge – either performed using the older v1.4.2 pipeline (Muehler et al., 2022 and Muehler et al., 2024) or using a workaround that required prior manipulation of input files in combination with setting specific pipeline parameters (summarized by Marine Cambon in a Slack thread).

Our recent attempts to run the nf-core/rnaseq pipeline v3.16.1, launched from the Seqera platform, using said workaround, however, failed (for details see section "Pipeline testing").

Conclusion

Neither the use of old pipeline versions nor the manipulation of input files is desirable from the perspective of using most updated software pipelines and standardized data processing workflows, respectively.

Pipeline testing

We followed the suggested workaround procedure for processing prokaryotic RNAseq data sets by using a genome fasta, a modified gtf and a self-generated transcript fasta file as input to the pipeline (v3.16.1) and by setting the following pipeline parameters:

--extra_star_align_args "--sjdbGTFfeatureExon CDS" (suggested in slack post)
--featurecounts_feature_type "CDS" (modified by us to use CDS from the gtf file; pipeline default setting is "exon").

The pipeline failed at Salmon quantification stage. The error message reports a mismatch of sequence lengths between entries of the user-provided transcript fasta file and the pipeline-generated SAM file.

Error message

Workflow execution completed unsuccessfully
 
The exit status of the task that caused the workflow execution to fail was: 1
Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT (SRR9681149)'

Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT (SRR9681149)` terminated with an error exit status (1)


Command executed:

  salmon quant \
      --geneMap GCF_000005845.2_ASM584v2_genomic.filtered.gtf \
      --threads 6 \
      --libType=ISR \
      -t GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved_transcript.fna \
      -a SRR9681149.Aligned.toTranscriptome.out.bam \
       \
      -o SRR9681149
  
  if [ -f SRR9681149/aux_info/meta_info.json ]; then
      cp SRR9681149/aux_info/meta_info.json "SRR9681149_meta_info.json"
  fi
  if [ -f SRR9681149/lib_format_counts.json ]; then
      cp SRR9681149/lib_format_counts.json "SRR9681149_lib_format_counts.json"
  fi
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT":
      salmon: $(echo $(salmon --version) | sed -e "s/salmon //g")
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Version Info: This is the most recent version of salmon.
  # salmon (alignment-based) v1.10.1
  # [ program ] => salmon 
  # [ command ] => quant 
  # [ geneMap ] => { GCF_000005845.2_ASM584v2_genomic.filtered.gtf }
  # [ threads ] => { 6 }
  # [ libType ] => { ISR }
  # [ targets ] => { GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved_transcript.fna }
  # [ alignments ] => { SRR9681149.Aligned.toTranscriptome.out.bam }
  # [ output ] => { SRR9681149 }
  Logs will be written to SRR9681149/logs
  Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) }
  [2025-01-27 14:07:36.152] [jointLog] [info] setting maxHashResizeThreads to 6
  [2025-01-27 14:07:36.152] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
  [2025-01-27 14:07:36.152] [jointLog] [info] numQuantThreads = 3
  parseThreads = 3
  Checking that provided alignment files have consistent headers . . . done
  Populating targets from aln = "SRR9681149.Aligned.toTranscriptome.out.bam", fasta = "GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved_transcript.fna" . . .[0;31m
  
  SAM file says target gnl|b0001|mrna.NP_414542 has length 63, but the FASTA file contains a sequence of length [66 or 65]

Work dir:
  az://seqera/scratch/18YRRZhBVsZxN6/0e/7701dec8788b9f8519918c4a03f516

Container:
  quay.io/biocontainers/salmon:1.10.1--h7e5ed60_0

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Steps to reproduce the error

Download the Escherichia coli RNAseq data set PRJNA554579 published by Rychel et al., 2023.
Download the reference genome fasta and the annotation gtf file for Escherichia coli str. K-12 substr. MG1655 (accession: GCF_000005845.2).
Extract and modify the gtf file by deleting all lines with an empty "transcript_id" field.

gunzip -k annotation.gtf.gz
sed '/transcript_id "" /d' annotation.gtf  > annotation_modified.gtf

Generate a custom transcript file from the extracted genome fasta and in 3. modified gtf file using gffread (v0.12.7).

gunzip -k genome.fna
gffread -w transcript.fna -g genome.fna annotation_modified.gtf

Run the nf-core/rnaseq pipeline v3.16.1 using (an appropriately adapted) command and configuration (only the essential part of our configuration is shown below).

Command

nextflow run 'https://github.com/nf-core/rnaseq'
         -name Ecoli_gtf_transcript_files_modified
         -params-file 'https://api.cloud.seqera.io/ephemeral/mDXJ21XZtCJ8MH6N6Jumgg.json'
         -with-tower
         -r 3.16.1

Parameters

params {
   input = 'az://seqera/sebastian_tests/project2_ecoli_partial/data/samplesheet.csv'
   genome = null
   splicesites = null
   gtf_extra_attributes = 'gene_name'
   gtf_group_features = 'gene_id'
   skip_gtf_filter = false
   skip_gtf_transcript_filter = false
   featurecounts_feature_type = 'CDS'
   featurecounts_group_type = 'gene_biotype'
   gencode = false
   save_reference = false
   igenomes_base = 's3://ngi-igenomes/igenomes/'
   igenomes_ignore = false
   with_umi = false
   skip_umi_extract = false
   umitools_extract_method = 'string'
   umitools_grouping_method = 'directional'
   umitools_dedup_stats = false
   umitools_bc_pattern = null
   umitools_bc_pattern2 = null
   umitools_umi_separator = null
   umi_discard_read = null
   save_umi_intermeds = false
   trimmer = 'trimgalore'
   min_trimmed_reads = 10000
   extra_trimgalore_args = null
   extra_fastp_args = null
   save_trimmed = false
   skip_trimming = false
   bbsplit_fasta_list = null
   save_bbsplit_reads = false
   skip_bbsplit = true
   remove_ribo_rna = false
   save_non_ribo_reads = false
   ribo_database_manifest = '/.nextflow/assets/nf-core/rnaseq/workflows/rnaseq/assets/rrna-db-defaults.txt'
   aligner = 'star_salmon'
   pseudo_aligner = null
   pseudo_aligner_kmer_size = 31
   seq_center = null
   bam_csi_index = false
   star_ignore_sjdbgtf = false
   salmon_quant_libtype = null
   hisat2_build_memory = '200.GB'
   stringtie_ignore_gtf = false
   min_mapped_reads = 5
   extra_star_align_args = '--sjdbGTFfeatureExon CDS'
   extra_salmon_quant_args = null
   extra_kallisto_quant_args = null
   kallisto_quant_fraglen = 200
   kallisto_quant_fraglen_sd = 200
   save_merged_fastq = false
   save_unaligned = false
   save_align_intermeds = false
   skip_markduplicates = false
   skip_alignment = false
   skip_pseudo_alignment = false
   stranded_threshold = 0.8
   unstranded_threshold = 0.1
   skip_qc = false
   skip_bigwig = false
   skip_stringtie = false
   skip_fastqc = false
   skip_preseq = true
   skip_dupradar = false
   skip_qualimap = false
   contaminant_screening = null
   kraken_db = null
   save_kraken_assignments = false
   save_kraken_unassigned = false
   bracken_precision = 'S'
   skip_rseqc = false
   skip_biotype_qc = false
   skip_deseq2_qc = false
   skip_multiqc = false
   deseq2_vst = true
   rseqc_modules = 'bam_stat,inner_distance,infer_experiment,junction_annotation,junction_saturation,read_distribution,read_duplication'
   multiqc_config = null
   multiqc_title = null
   multiqc_logo = null
   max_multiqc_email_size = '25.MB'
   multiqc_methods_description = null
   outdir = 'az://seqera/sebastian_tests/project2_ecoli_partial/results_gtf_transcipt_files_modified/'
   publish_dir_mode = 'copy'
   email = null
   email_on_fail = null
   plaintext_email = false
   monochrome_logs = false
   hook_url = null
   help = false
   help_full = false
   show_hidden = false
   version = false
   pipelines_testdata_base_path = 'https://raw.githubusercontent.com/nf-core/test-datasets/7f1614baeb0ddf66e60be78c3d9fa55440465ac8/'
   config_profile_name = null
   config_profile_description = null
   custom_config_version = 'master'
   umi_dedup_tool = 'umitools'
   extra_fqlint_args = '--disable-validator P001'
   custom_config_base = 'https://raw.githubusercontent.com/nf-core/configs/master'
   skip_linting = false
   fasta = 'az://seqera/sebastian_tests/project2_ecoli_partial/data/GCF_000005845.2_ASM584v2_genomic.fna.gz'
   gtf = 'az://seqera/sebastian_tests/project2_ecoli_partial/data/GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved.gtf.gz'
   transcript_fasta = 'az://seqera/sebastian_tests/project2_ecoli_partial/data/GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved_transcript.fna.gz'
   config_profile_contact = null
   config_profile_url = null
   validate_params = true
   genomes {
…

Recommendations for v3.x pipeline development

The following points are subject to discussion with the nf-core community.

Prokaryotic vs. eukaryotic mode
It might be beneficial to have a parameter to run the pipeline in either --prokaryotic or --eukaryotic mode, as suggested previously (#1085).

Integration of featureCounts
Re-introducing featureCounts into v3.x pipelines might be beneficial for processing prokaryotic datasets. This has been suggested recently (Perelo et al., 2024):

“Regarding pipeline development, it would be worth discussing to include the tool featureCounts as a quantification tool option also in the v3.x pipelines. The advantage of featureCounts on datasets with a high number of one-isoform genes has been shown here and in a previous study (11) and should also be kept in mind for the analysis of prokaryotic transcriptomes.”
(text snippet from Perelo et al., 2024)

Alignment and Quantification Tools
Using STAR might not be optimal for prokaryotic datasets. A splice-unaware aligner might be the better choice since bacterial transcripts lack introns and are therefore not (alternatively) spliced (Mahmud et al., 2021):

“Prokaryotic RNA-Seq analysis is challenging because most available RNA-Seq packages assume the input data reflect eukaryotic gene structures, which in many aspects differ from those of prokaryotes (Johnson et al., 2016). Bacterial transcripts do not have introns and are not alternatively spliced; therefore, using an aligner developed to consider splice junctions often increases falsely assigned reads in the genome (Magoc et al., 2013). Moreover, unlike in eukaryotes, under specific stresses, the expression of almost all prokaryotic genes can change (Creecy and Conway, 2015).”
(text snippet from Mahmud et al., 2021)

We suggest integrating Bowtie/Bowtie2 as a splice-unaware aligner and featureCounts for quantification. This combination is used in other prokaryotic RNAseq processing pipelines (see the iModulon pipeline and the ProkSeq pipeline).

Mar 12 '25 22:03 sebschulz1

Hi @sebschulz1! Thank you for the beautifully thought-out summary 🤩 I completely agree with you that the pipeline hasn't handled prokaryotic genomes optimally - partly because of poor annotation consistency and partly to keep the maintenance overhead down.

From an implementation perspective, has anyone tried to run STAR directly against the genome and not the transcriptome and then using featureCounts downstream? Trying to figure out if we can avoid adding and using another aligner to the pipeline if we don't see any explicit gains?

Mar 17 '25 18:03 drpatelh

Indeed, thanks a lot for this thorough assessment. Most welcome if the three of you would be eager working in this direction!

I wonder, what the difference between using featureCounts and just STARs (--quantMode GeneCounts) native counting mechanism would be? Except when UMI deduplication is performed after the alignment, there is probably a high concordance between featureCounts and counts obtained during alignment?

Also, by setting an extremely high penalty on the opening of gaps, STAR can presumably be used as an splice-unaware aligner: --alignIntronMax 1 --alignIntronMin 2 --scoreDelOpen -10000 --scoreInsOpen -10000 --alignEndsType EndToEnd . This has been suggested by Alex Dobin, the author of STAR himself.

Mar 17 '25 19:03 MatthiasZepper

Thanks for all the above everyone :-).

It's been on my list for a while to improve the prokaryotic stuff to incorporate suggestions from previous discussion, but I maybe hadn't been planning changes quite as extensive as some of what's here, so good have the proposals clear.

Adding a new aligner would be quite a bit of additional complexity, we'd have new params, new genome indexing steps etc. It may be worth it, but when we start heading that way I start to wonder if maybe we should be thinking of a specialist workflow. Preprocessing is already a subworkflow, for example, so could be re-used in a new workflow easily.

Before we start thinking that way, let's see if existing workarounds (with improvements as suggested by @MatthiasZepper ) do the trick.

Stage 1

The first step is to see how effective adapting the current components for prokaryotic data can be, using existing workarounds. The Salmon error doesn't invalidate the approach, it just means we need to investigate/ debug as necessary.

I think we probably need a prokaryotic profile. That could override the existing workflow's parameterisation (e.g. extra_star_align_args) to work better with prokaryotic data.

Stage 2

If parameterisation is insufficient and we need e.g. preprocessing steps specifically for prokaryotic data, we could add some new pipeline logic, maybe moving eukaryotic and prokaryotic alignment / quantification to their own local subworkflows,

Stage 3

The above have proved insufficient, and we move to use a completely different toolset for prokaryotic data- e.g. Bowtie2 -> featureCounts. Again, at that point I'd be asking if it was actually appropriate to do both types of analysis in the same pipeline. Might be, but the question would need to be asked.

Mar 18 '25 10:03 pinin4fjords

Hello guys! Cool with all the input and the different suggestions stated here. Thank you for all of the answers!

A team from DTU including @sebschulz1 , @Juassis , @pcolaianni , @txellext and I are testing the different options and hopefully we will share here some results!

Mar 24 '25 09:03 apalleja

Hello, is there any ETA for support of analysis of prokaryotic genomes? Best Marek

Jul 15 '25 07:07 SchwarzMarek

I just landed here after running into an error trying to run this on a prokaryote - I'm hitting an error in dupradar for some samples, but I also noticed the fact that I only have ~100 genes in my salmon output.

I wonder if it would be worth adding a note about this in the README? There's no mention at all in the Introduction, and according to the linked slack comments, the instructions for prokaryotes is out of date. Not an awesome experience for a new user (🙋)

Sep 05 '25 19:09 kescobo

🧬 Running STAR + Salmon for E. coli Genomes

This test was performed with two versions of the nf-core/rnaseq pipeline:

🔹 3.16.1
🔹 3.19.0

👉 Our goal here is to find the source of the problem and share a temporary manual fix. This workaround doesn’t require any changes to the nf-core/rnaseq pipeline. Here, we (@sebschulz1, @apalleja, @pcolaianni) show how to patch the GTF file so it works with the pipeline.

Meanwhile, we are still evaluating multiple software combinations (see our GitHub repo).

❌ What is the error?

The nf-core/rnaseq community working with prokaryotic genomes has repeatedly reported this issue in different ways.
All reports trace back to the genome annotation limitations for bacteria:

Most bacterial GTF/GFF files ❌ do not include exon information for coding regions.
They usually contain exon features only for tRNAs and rRNAs.
Instead, the CDS feature is used to represent protein-coding genes.

🔬 Biologically, this is fine. But many RNA-seq tools were originally developed for eukaryotic genomes, leading to several limitations for bacteria.

🚧 Where is the problem in nf-core/rnaseq?

Problematic task:

NFCORE_RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA

Container used:

community.wave.seqera.io/library/rsem_star:5acb4e8c03239c32

Command:

rsem-prepare-reference \ 
--gtf clean.filtered.fixed.gtf \ 
--num-threads 12 \ 
\ 
rsem/GCF_000005845.2_ASM584v2_genomic.fna \ 
rsem/genome 

cp rsem/genome.transcripts.fa .

Originally RSEM (RNA-Seq by Expectation-Maximization) is a tool to estimate gene and isoform expression levels from RNA-seq data. Although we're not selecting RSEM as the expression quantification tool, the nf-core/rnaseq pipeline still uses it to prepare the genome reference rsem-prepare-reference.

ℹ️ By default, RSEM uses the exon feature to extract the transcript file.
There is currently no option in the nf-core/rnaseq pipeline to force RSEM to use CDS instead.
We also could not find a built-in RSEM flag to switch from exon → CDS.

👉 The pipeline suggests using gffread instead, but it does not actually implemented it.

🛠️ How did we fix the RSEM limitation for bacteria?

There are many possible strategies, but the solution that worked best for us was:

➡️ Fix the GTF file before running the nf-core/rnaseq pipeline.

We recommend using AGAT.

🔹 What is AGAT?

AGAT (Another Gtf/Gff Analysis Toolkit) is a collection of Perl tools for working with genome annotation files (GFF and GTF). It allows you to convert, check, edit, merge, and extract information from annotation files. Very useful because genome annotations from NCBI, Ensembl, or other sources often differ in format and consistency.

What happens with the GTF/GFF file when we use agat_convert_sp_gff2gtf.pl:

Reads the GFF3 or GTF file (hierarchical structure with features like gene → mRNA → exon → CDS).
Flattens and reformats it into GTF (tab-delimited, feature-centric format).
Ensures that required fields (like gene_id, transcript_id) are present and consistent.

📋 Steps to Reproduce

📥 Download the E. coli RNA-seq dataset PRJNA554579 (Rychel et al., 2023).
📥 Download the reference genome FASTA and annotation GTF file for E. coli K-12 MG1655 (GCF_000005845.2).
🧹 Fix the GTF file with AGAT (see below).
▶️ Run the nf-core/rnaseq pipeline with bacterial parameters.

🔧 Fixing the GTF file with AGAT

agat_convert_sp_gff2gtf.pl --gff 1_test.gff -o clean.filtered.gtf

📝 It is possible to pass a GTF file as input file.

⚠️ Unfortunately, the E. coli GTF is still incompatible with RSEM. You need to further clean the transcript_id field.

❌ Do NOT make transcript_id empty as suggested in previous version, STAR + Salmon will not accept it.
✅ Instead, fix the error messages by replacing invalid characters:

sed 's/transcript_id "mrna\.CDS=/transcript_id "mrna_CDS_/g' clean.filtered.gtf > clean.filtered.fixed.gtf

🚀 Running STAR + Salmon

Once the GTF is fixed, you are almost ready to run the pipeline.
For prokaryotes, STAR must be run in splice-unaware mode.

Recommended parameters:

--alignIntronMax 1 
--alignIntronMin 2 
--scoreDelOpen -10000 
--scoreInsOpen -10000 
--alignEndsType EndToEnd 
--quantMode TranscriptomeSAM 
--noLengthCorrection

Explanation:

--alignIntronMax 1 → disables introns (no splicing).
--alignIntronMin 2 → required since min < max, but irrelevant with introns disabled.
--scoreDelOpen -10000 / --scoreInsOpen -10000 → prohibit indels.
--alignEndsType EndToEnd → no soft clipping, full-length alignments only.
--quantMode TranscriptomeSAM → output transcriptome-mapped BAM for quantification.
--noLengthCorrection → disables Salmon effective length correction (arguably better for CDS-based quantification).

📊 Summary of the Analysis

🐌 Bottleneck → rsem-prepare-reference step for bacterial genomes.
🧹 Fixing the GTF is mandatory before running nf-core/rnaseq for bacteria.
🛠 AGAT works best for conversion.
⚠️ Even after AGAT, small manual adjustments may be needed depending on annotation quality.
🔧 STAR requires custom parameters to run correctly in splice-unaware mode for bacteria.

⚠️ Important Considerations

We tested many parameter combinations and correction strategies.
This is the most reliable approach so far.

Example error message (before applying --noLengthCorrection):

Checking that provided alignment files have consistent headers . . . done
Populating targets from aln = "SRR9681149.Aligned.toTranscriptome.out.bam", fasta = "GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved_transcript.fna"
SAM file says target gnl|b0001|mrna.NP_414542 has length 63, but the FASTA file contains a sequence of length [66 or 65]

✅ Fixed with:

--noLengthCorrection

Another possible error/warning:

[warning] Transcript unassigned_transcript_209 appears in the reference but did not appear in the BAM
[jointLog] [critical] Transcript mrna.CDS appeared in the BAM header, but was not in the provided FASTA file

👉 This is a mismatch between BAM headers (from STAR) and the transcriptome FASTA.
The pipeline suggests using gffread, but in practice, you may need to manually reconcile BAM header IDs with FASTA IDs.

🦠 We only tested it with E. coli, our goal is to test with more genomes.

🔭 What are we doing now?

Bacterial annotations differ significantly from eukaryotic ones, and forcing them into the same software ecosystem is challenging.

We are currently:

🧪 Evaluating different combinations of software.
🐳 Packaging them into a Docker container for reproducibility.
📦 Preparing results to share soon in our GitHub.

💡 Our strategy we consider replacing the alignment and quantification toolsets entirely (e.g., using Bowtie2 and featureCounts instead of STAR and Salmon). At this stage, we also re-evaluate whether both organism types should be handled in the same pipeline

🧬 Updates will come as soon as we can fit them into our schedule, thanks for your patience 🚀

Sep 15 '25 15:09 Juassis

Improvements for processing prokaryotic datasets using the nf-core/rnaseq pipeline

Description of feature

Conclusion

Pipeline testing

Error message

Steps to reproduce the error

Recommendations for v3.x pipeline development

The following points are subject to discussion with the nf-core community.

🧬 Running STAR + Salmon for E. coli Genomes

❌ What is the error?

🚧 Where is the problem in nf-core/rnaseq?

🛠️ How did we fix the RSEM limitation for bacteria?

📋 Steps to Reproduce

🔧 Fixing the GTF file with AGAT

🚀 Running STAR + Salmon

📊 Summary of the Analysis

⚠️ Important Considerations

🦠 We only tested it with E. coli, our goal is to test with more genomes.

🔭 What are we doing now?