rnaseq
rnaseq copied to clipboard
Salmon quantification output is missing genes when using gff3 annotation
Description of the bug
Trying to use a gff3 annotation for the pipeline, the output table is missing most of the genes.
I suspect it's something with the gff to gtf conversion with gffread, possibly using only entries with 'exon' lines
Command used and terminal output
nextflow run \
nf-core/rnaseq -r 3.18.0 \
-profile singularity \
-resume \
--input samplesheet.csv \
--outdir results_rnaseq \
--gff E_coli_BW25113_annotation.gff \
--fasta E_coli_BW25113_genome.fasta \
--igenomes_ignore \
--genome null \
--skip_biotype_qc \
--skip_rseqc
Relevant files
E_coli_BW25113_annotation.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region CP009273.1 1 4631469
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=679895
CP009273.1 Genbank region 1 4631469 . + . ID=CP009273.1:1..4631469;Dbxref=taxon:679895;Is_circular=true;Name=ANONYMOUS;collection-date=2000;country=USA: Indiana;culture-collection=CGSC:7636;gbkey=Src;genome=chromosome;genotype=rrnB3 lacZ4787 hsdR514 (araBAD)567 (rhaBAD)568 rph-1;mol_type=genomic DNA;note=from B. L. Wanner laboratory;strain=K-12;substrain=BW25113
CP009273.1 Genbank gene 190 255 . + . ID=gene-BW25113_0001;Name=thrL;gbkey=Gene;gene=thrL;gene_biotype=protein_coding;gene_synonym=ECK0001,JW4367;locus_tag=BW25113_0001
CP009273.1 Genbank CDS 190 255 . + 0 ID=cds-AIN30539.1;Parent=gene-BW25113_0001;Dbxref=NCBI_GP:AIN30539.1;Name=AIN30539.1;gbkey=CDS;gene=thrL;locus_tag=BW25113_0001;product=thr operon leader peptide;protein_id=AIN30539.1;transl_table=11
CP009273.1 Genbank gene 337 2799 . + . ID=gene-BW25113_0002;Name=thrA;gbkey=Gene;gene=thrA;gene_biotype=protein_coding;gene_synonym=ECK0002,Hs,JW0001,thrA1,thrA2,thrD;locus_tag=BW25113_0002
CP009273.1 Genbank CDS 337 2799 . + 0 ID=cds-AIN30540.1;Parent=gene-BW25113_0002;Dbxref=NCBI_GP:AIN30540.1;Name=AIN30540.1;Note=bifunctional: aspartokinase I (N-terminal)%3B homoserine dehydrogenase I (C-terminal);experiment=N-terminus verified by Edman degradation:PMID 354697%2C4562989;gbkey=CDS;gene=thrA;locus_tag=BW25113_0002;product=Bifunctional aspartokinase/homoserine dehydrogenase 1;protein_id=AIN30540.1;transl_table=11
...
contains 4k+ genes
tx2gene.tsv has 4491 lines
salmon.merged.gene_counts.tsv
gene_id gene_name dwaaC_minusCaL_i dwaaC_minusCaL_ii dwaaC_minusCaL_iii dwaaC_plusCaL_i dwaaC_plusCaL_ii dwaaC_plusCaL_iii WT_minusCaL_i WT_minusCaL_iiWT_minusCaL_iii WT_plusCaL_i WT_plusCaL_ii WT_plusCaL_iii
gene-BW25113_0201 rrsH 32694.197 31574.414 38547.193 30267.249 52068.988 31854.076 26227.147 20798.888 192851.168 16902.049 19981.584 19476.323
gene-BW25113_0202 ileV 0 0 0 0 0 0 0 9 60 0 12
gene-BW25113_0203 alaV 38.729 18.365 31.365 63.391 58.096 24.644 24.578 0 10.59 38.481 34.433 0
gene-BW25113_0204 rrlH 92145.698 84139.543 113628.01 74832.386 110317.385 69206.269 111249.675 84373.659 397041.287 53441.406 53113.014 52625.037
gene-BW25113_0205 rrfH 193.193 96.878 0 128.627 274.961 160.568 89.404 184.567 623.901 87.854 0 88.185
gene-BW25113_0206 aspU 123.398 0 0 116.261 121.072 107.542 56.663 57.32225.359 71.551 0 96.843
gene-BW25113_0216 aspV 123.832 178 249 33.574 0 0.001 56.632 64.08825.314 0 0 0
gene-BW25113_0244 thrW 33 17 22 25 21 17 6 10 918 12 11
gene-BW25113_0455 ffs 3844 2055 2320 3252 3457 2746 1896 2116 912 1953 2123 2206
...
has only 178 lines
- it is missing all entries from the gff with only 'gene' and 'CDS' lines (most genes)
- includes only the entries with 'gene', 'ncrna' (or other) AND 'exon' lines, e.g.
CP009273.1 Genbank gene 186199 186334 . + . ID=gene-BW25113_4414;Name=tff;gbkey=Gene;gene=tff;gene_biotype=ncRNA;gene_synonym=ECK0167,JWR0225,t44;locus_tag=BW25113_4414
CP009273.1 Genbank ncRNA 186199 186334 . + . ID=rna-BW25113_4414;Parent=gene-BW25113_4414;Note=identified in a large scale screen;gbkey=ncRNA;gene=tff;locus_tag=BW25113_4414;product=novel sRNA%2C function unknown
CP009273.1 Genbank exon 186199 186334 . + . ID=exon-BW25113_4414-1;Parent=rna-BW25113_4414;Note=identified in a large scale screen;gbkey=ncRNA;gene=tff;locus_tag=BW25113_4414;product=novel sRNA%2C function unknown
E_coli_BW25113_genome.filtered.gtf The gtf produced by the pipeline looks like this: Note all of these entries are missing in the Salmon table (possibly because of the missing 'exon' line?)
CP009273.1 Genbank transcript 190 255 . + . transcript_id "gene-BW25113_0001"; gene_id "gene-BW25113_0001"; gene_name "thrL"; Name "thrL"; gbkey "Gene"; gene "thrL"; gene_biotype "protein_coding"; gene_synonym "ECK0001,JW4367"; locus_tag "BW25113_0001"
CP009273.1 Genbank CDS 190 255 . + 0 transcript_id "gene-BW25113_0001"; gene_name "thrL"; Dbxref "NCBI_GP:AIN30539.1"; Name "AIN30539.1"; gbkey "CDS"; gene "thrL"; locus_tag "BW25113_0001"; product "thr operon leader peptide"; protein_id "AIN30539.1"; transl_table "11";
CP009273.1 Genbank transcript 337 2799 . + . transcript_id "gene-BW25113_0002"; gene_id "gene-BW25113_0002"; gene_name "thrA"; Name "thrA"; gbkey "Gene"; gene "thrA"; gene_biotype "protein_coding"; gene_synonym "ECK0002,Hs,JW0001,thrA1,thrA2,thrD"; locus_tag "BW25113_0002"
CP009273.1 Genbank CDS 337 2799 . + 0 transcript_id "gene-BW25113_0002"; gene_name "thrA"; Dbxref "NCBI_GP:AIN30540.1"; Name "AIN30540.1"; Note "bifunctional: aspartokinase I (N-terminal)%3B homoserine dehydrogenase I (C-terminal)"; experiment "N-terminus verified by Edman degradation:PMID 354697%2C4562989"; gbkey "CDS"; gene "thrA"; locus_tag "BW25113_0002"; product "Bifunctional aspartokinase/homoserine dehydrogenase 1"; protein_id "AIN30540.1"; transl_table "11";
CP009273.1 Genbank transcript 2801 3733 . + . transcript_id "gene-BW25113_0003"; gene_id "gene-BW25113_0003"; gene_name "thrB"; Name "thrB"; gbkey "Gene"; gene "thrB"; gene_biotype "protein_coding"; gene_synonym "ECK0003,JW0002"; locus_tag "BW25113_0003"
CP009273.1 Genbank CDS 2801 3733 . + 0 transcript_id "gene-BW25113_0003"; gene_name "thrB"; Dbxref "NCBI_GP:AIN30541.1"; Name "AIN30541.1"; gbkey "CDS"; gene "thrB"; locus_tag "BW25113_0003"; product "homoserine kinase"; protein_id "AIN30541.1"; transl_table "11";
CP009273.1 Genbank transcript 3734 5020 . + . transcript_id "gene-BW25113_0004"; gene_id "gene-BW25113_0004"; gene_name "thrC"; Name "thrC"; gbkey "Gene"; gene "thrC"; gene_biotype "protein_coding"; gene_synonym "ECK0004,JW0003"; locus_tag "BW25113_0004"
CP009273.1 Genbank CDS 3734 5020 . + 0 transcript_id "gene-BW25113_0004"; gene_name "thrC"; Dbxref "NCBI_GP:AIN30542.1"; Name "AIN30542.1"; experiment "N-terminus verified by Edman degradation:PMID 9298646%2C9600841%2C9740056"; gbkey "CDS"; gene "thrC"; locus_tag "BW25113_0004"; product "L-threonine synthase"; protein_id "AIN30542.1"; transl_table "11";
CP009273.1 Genbank transcript 5234 5530 . + . transcript_id "gene-BW25113_0005"; gene_id "gene-BW25113_0005"; gene_name "yaaX"; Name "yaaX"; gbkey "Gene"; gene "yaaX"; gene_biotype "protein_coding"; gene_synonym "ECK0005,JW0004"; locus_tag "BW25113_0005"
CP009273.1 Genbank CDS 5234 5530 . + 0 transcript_id "gene-BW25113_0005"; gene_name "yaaX"; Dbxref "NCBI_GP:AIN30543.1"; Name "AIN30543.1"; gbkey "CDS"; gene "yaaX"; locus_tag "BW25113_0005"; product "DUF2502 family putative periplasmic protein"; protein_id "AIN30543.1"; transl_table "11";
...
I also tried adding 'exon' lines to the gff manually, which resulted in an RSEM error for not finding 'gene_id', possibly due to a broken gffread conversion:
Command error:
INFO: Converting SIF file to temporary sandbox...
rsem-extract-reference-transcripts rsem/genome 0 E_coli_BW25113_genome.filtered.gtf None 0 rsem/E_coli_BW25113_genome.fasta
The GTF file might be corrupted!
Stop at line : CP009273.1 Genbank exon 190 255 . + . transcript_id "cds-AIN30539.1"; gene_name "thrL"; Dbxref "NCBI_GP:AIN30539.1"; Name "AIN30539.1"; gbkey "CDS"; gene "thrL"; locus_tag "BW25113_0001"; product "thr operon leader peptide"; protein_id "AIN30539.1"; transl_table "11";
Error Message: Cannot find gene_id!
"rsem-extract-reference-transcripts rsem/genome 0 E_coli_BW25113_genome.filtered.gtf None 0 rsem/E_coli_BW25113_genome.fasta" failed! Plase check if you provide correct parameters/options for the pipeline!
INFO: Cleaning up image...
Using gff is highly unstable in the pipeline, maybe another gtf conversion tool is needed
I also converted the gff to gtf with AGAT:
agat_convert_sp_gff2gtf.pl --gff E_coli_BW25113_annotation.gff3 -o E_coli_BW25113_annotation_AGAT.gtf
and the pipeline ran successfully, generating counts for all genes.
It does unfortunately replace the gene names with 'agat-gene-1' etc (though maybe that can be fixed with --gtf_extra_attributes)
EDIT: adding --gtf_extra_attributes gene did indeed rescue correct gene names in the output
Any ideas?
System information
N E X T F L O W ~ version 24.10.4
Launching `https://github.com/nf-core/rnaseq` [suspicious_nobel] DSL2 - revision: b96a75361a [3.18.0]
------------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/rnaseq 3.18.0
------------------------------------------------------
Input/output options
input : samplesheet.csv
outdir : results_rnaseq
Reference genome options
genome : null
fasta : E_coli_BW25113_genome.fasta
gff : E_coli_BW25113_annotation.gff
igenomes_ignore: true
Process skipping options
skip_rseqc : true
skip_biotype_qc: true
Core Nextflow options
revision : 3.18.0
runName : suspicious_nobel
containerEngine: singularity
...
profile : singularity
configFiles :