rnaseq Documentation update: Running the rnaseq pipeline on prokaryotic samples

Description of feature

The documentation of the rnaseq pipeline that refers to running the pipeline on prokaryotic samples is unfortunately completely outdated. It specifies the required settings for featureCounts, even though that tool has been superseded by salmon for transcript quantification about 10 pipeline releases ago!

On Slack, Marine Cambon has already kindly written up what is needed to run the more recent versions of the pipeline successfully with prokaryotic RNA-seq. However, somebody needs to update the pipeline documentation accordingly. Since this requires only Markdown edits, I think it is a suitable task for the Hackathon?

For some general recommendations on how to write good technical documentation, see the website of the Diátaxis framework.

Oct 05 '23 11:10 MatthiasZepper

As of now (3.18.0) the documentation didnt change much. Also, I couldn't run the pipeline following these suggestions recently. I was using data for E. coli strain BW25113 (4522 genes, 4419 CDS, 121 exon) of the NCBI RefSeq annotation from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000750555.1/.

The following seems important:

only "exon" are measured, therefore all features of interest need to have an "exon"
there may not be any special signs or maybe even not space in column number 2 (and often is)
gene_id are neccessary for each "exon" entry

The simplest solution I found for now was using the annotation in gff format and convert & simplify with gffread. Additionally gene_id have to be added then. This can be achieved as below.

gffread genomic.gff --keep-exon-attrs -F -T --force-exons | sed '/gene_id/!s/transcript_id\([^;]*\)/&; \0/;/transcript_id.*transcript_id/s/transcript_id/gene_id/' > genomic_gffread_force-exons_geneid.gtf

A similar discussion of the problem can be found in slack https://nfcore.slack.com/archives/CE8SSJV3N/p1679577061847429?thread_ts=1677835193.447669&cid=CE8SSJV3N

This solution should be applicable broadly, but I am not sure, but if that is indeed helpful then it might make sense to add that info to the docs.

Jan 24 '25 10:01 d4straub

I just stumbled upon this issue while searching through Slack (see my own thread here).

I'd like to add the following info here:

My current understanding is that in cases where the automatic transcript file generation is lacking, this can be done manually, but during the STAR index creation the feature type needs to be explicitly set to the relevant one (--sjdbGTFfeatureExon =exon by default), in order for the trascriptome-aligned bam files to match any manually created transcript fasta (e.g. when your GTF lacks exons and you rely on CDS entries instead).
While this slack write-up by Marine Cambon indicates that rsem-prepare-reference uses the exon feature entries of the GTF file, but the authors of rsem mention that both (?) transcript and exon lines are used here: https://groups.google.com/forum/#!topic/rsem-users/GoWPrFzMHMI. I can't figure out which ones matter,
This appears to not just be a problem for prokaryotes, but also for eukaryotes where the GTF file consists of mainly gene and CDS features, the latter being the only mandatory feature (besides start_codon and stop_codon) according to https://agat.readthedocs.io/en/latest/gxf.html#gtf2-2. E.g., I encountered this for an (eukaryotic) GenBank annotation of P. vivax. In this case, it appears that gene is used as the top-level feature, with transcript only being used for a few t/rRNAs. ~So that would mean I'd need to change the sjdbGTFtagExonParentTranscript flag to gene instead of the default transcript, instead of changing sjdbGTFfeatureExon.~ As was pointed out by @tdanhorn on Slack: it's not the transcript lines (feature_type) in the GTF that are used, but the transcript_id attribute in the exon lines.

Feb 11 '25 09:02 pmoris