Documentation update: Running the rnaseq pipeline on prokaryotic samples
Description of feature
The documentation of the rnaseq pipeline that refers to running the pipeline on prokaryotic samples is unfortunately completely outdated. It specifies the required settings for featureCounts, even though that tool has been superseded by salmon for transcript quantification about 10 pipeline releases ago!
On Slack, Marine Cambon has already kindly written up what is needed to run the more recent versions of the pipeline successfully with prokaryotic RNA-seq. However, somebody needs to update the pipeline documentation accordingly. Since this requires only Markdown edits, I think it is a suitable task for the Hackathon?
For some general recommendations on how to write good technical documentation, see the website of the Diátaxis framework.
As of now (3.18.0) the documentation didnt change much. Also, I couldn't run the pipeline following these suggestions recently. I was using data for E. coli strain BW25113 (4522 genes, 4419 CDS, 121 exon) of the NCBI RefSeq annotation from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000750555.1/.
The following seems important:
- only "exon" are measured, therefore all features of interest need to have an "exon"
- there may not be any special signs or maybe even not space in column number 2 (and often is)
- gene_id are neccessary for each "exon" entry
The simplest solution I found for now was using the annotation in gff format and convert & simplify with gffread. Additionally gene_id have to be added then. This can be achieved as below.
gffread genomic.gff --keep-exon-attrs -F -T --force-exons | sed '/gene_id/!s/transcript_id\([^;]*\)/&; \0/;/transcript_id.*transcript_id/s/transcript_id/gene_id/' > genomic_gffread_force-exons_geneid.gtf
A similar discussion of the problem can be found in slack https://nfcore.slack.com/archives/CE8SSJV3N/p1679577061847429?thread_ts=1677835193.447669&cid=CE8SSJV3N
This solution should be applicable broadly, but I am not sure, but if that is indeed helpful then it might make sense to add that info to the docs.
I just stumbled upon this issue while searching through Slack (see my own thread here).
I'd like to add the following info here:
- My current understanding is that in cases where the automatic transcript file generation is lacking, this can be done manually, but during the STAR index creation the feature type needs to be explicitly set to the relevant one (
--sjdbGTFfeatureExon=exon by default), in order for the trascriptome-aligned bam files to match any manually created transcript fasta (e.g. when your GTF lacks exons and you rely on CDS entries instead). - While this slack write-up by Marine Cambon indicates that rsem-prepare-reference uses the
exonfeature entries of the GTF file, but the authors ofrsemmention that both (?)transcriptandexonlines are used here: https://groups.google.com/forum/#!topic/rsem-users/GoWPrFzMHMI. I can't figure out which ones matter, - This appears to not just be a problem for prokaryotes, but also for eukaryotes where the GTF file consists of mainly
geneandCDSfeatures, the latter being the only mandatory feature (besidesstart_codonandstop_codon) according to https://agat.readthedocs.io/en/latest/gxf.html#gtf2-2. E.g., I encountered this for an (eukaryotic) GenBank annotation of P. vivax. In this case, it appears thatgeneis used as the top-level feature, withtranscriptonly being used for a few t/rRNAs. ~So that would mean I'd need to change thesjdbGTFtagExonParentTranscriptflag togeneinstead of the defaulttranscript, instead of changingsjdbGTFfeatureExon.~ As was pointed out by @tdanhorn on Slack: it's not the transcript lines (feature_type) in the GTF that are used, but the transcript_id attribute in the exon lines.