rnaseq Salmon quantification output is missing genes when using gff3 annotation

Salmon quantification output is missing genes when using gff3 annotation

Open RaverJay opened this issue 7 months ago • 1 comments

Description of the bug

Trying to use a gff3 annotation for the pipeline, the output table is missing most of the genes.

I suspect it's something with the gff to gtf conversion with gffread, possibly using only entries with 'exon' lines

Command used and terminal output

nextflow run \
    nf-core/rnaseq -r 3.18.0 \
    -profile singularity \
    -resume \
    --input samplesheet.csv \
    --outdir results_rnaseq \
    --gff E_coli_BW25113_annotation.gff \
    --fasta E_coli_BW25113_genome.fasta \
    --igenomes_ignore \
    --genome null \
    --skip_biotype_qc \
    --skip_rseqc

Relevant files

E_coli_BW25113_annotation.gff

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region CP009273.1 1 4631469
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=679895
CP009273.1	Genbank	region	1	4631469	.	+	.	ID=CP009273.1:1..4631469;Dbxref=taxon:679895;Is_circular=true;Name=ANONYMOUS;collection-date=2000;country=USA: Indiana;culture-collection=CGSC:7636;gbkey=Src;genome=chromosome;genotype=rrnB3 lacZ4787 hsdR514 (araBAD)567 (rhaBAD)568 rph-1;mol_type=genomic DNA;note=from B. L. Wanner laboratory;strain=K-12;substrain=BW25113
CP009273.1	Genbank	gene	190	255	.	+	.	ID=gene-BW25113_0001;Name=thrL;gbkey=Gene;gene=thrL;gene_biotype=protein_coding;gene_synonym=ECK0001,JW4367;locus_tag=BW25113_0001
CP009273.1	Genbank	CDS	190	255	.	+	0	ID=cds-AIN30539.1;Parent=gene-BW25113_0001;Dbxref=NCBI_GP:AIN30539.1;Name=AIN30539.1;gbkey=CDS;gene=thrL;locus_tag=BW25113_0001;product=thr operon leader peptide;protein_id=AIN30539.1;transl_table=11
CP009273.1	Genbank	gene	337	2799	.	+	.	ID=gene-BW25113_0002;Name=thrA;gbkey=Gene;gene=thrA;gene_biotype=protein_coding;gene_synonym=ECK0002,Hs,JW0001,thrA1,thrA2,thrD;locus_tag=BW25113_0002
CP009273.1	Genbank	CDS	337	2799	.	+	0	ID=cds-AIN30540.1;Parent=gene-BW25113_0002;Dbxref=NCBI_GP:AIN30540.1;Name=AIN30540.1;Note=bifunctional: aspartokinase I (N-terminal)%3B homoserine dehydrogenase I (C-terminal);experiment=N-terminus verified by Edman degradation:PMID 354697%2C4562989;gbkey=CDS;gene=thrA;locus_tag=BW25113_0002;product=Bifunctional aspartokinase/homoserine dehydrogenase 1;protein_id=AIN30540.1;transl_table=11
...

contains 4k+ genes

tx2gene.tsv has 4491 lines

salmon.merged.gene_counts.tsv

gene_id	gene_name	dwaaC_minusCaL_i	dwaaC_minusCaL_ii	dwaaC_minusCaL_iii	dwaaC_plusCaL_i	dwaaC_plusCaL_ii	dwaaC_plusCaL_iii	WT_minusCaL_i	WT_minusCaL_iiWT_minusCaL_iii	WT_plusCaL_i	WT_plusCaL_ii	WT_plusCaL_iii
gene-BW25113_0201	rrsH	32694.197	31574.414	38547.193	30267.249	52068.988	31854.076	26227.147	20798.888	192851.168	16902.049	19981.584	19476.323
gene-BW25113_0202	ileV	0	0	0	0	0	0	0	9	60	0	12
gene-BW25113_0203	alaV	38.729	18.365	31.365	63.391	58.096	24.644	24.578	0	10.59	38.481	34.433	0
gene-BW25113_0204	rrlH	92145.698	84139.543	113628.01	74832.386	110317.385	69206.269	111249.675	84373.659	397041.287	53441.406	53113.014	52625.037
gene-BW25113_0205	rrfH	193.193	96.878	0	128.627	274.961	160.568	89.404	184.567	623.901	87.854	0	88.185
gene-BW25113_0206	aspU	123.398	0	0	116.261	121.072	107.542	56.663	57.32225.359	71.551	0	96.843
gene-BW25113_0216	aspV	123.832	178	249	33.574	0	0.001	56.632	64.08825.314	0	0	0
gene-BW25113_0244	thrW	33	17	22	25	21	17	6	10	918	12	11
gene-BW25113_0455	ffs	3844	2055	2320	3252	3457	2746	1896	2116	912	1953	2123	2206
...

has only 178 lines

it is missing all entries from the gff with only 'gene' and 'CDS' lines (most genes)
includes only the entries with 'gene', 'ncrna' (or other) AND 'exon' lines, e.g.

CP009273.1	Genbank	gene	186199	186334	.	+	.	ID=gene-BW25113_4414;Name=tff;gbkey=Gene;gene=tff;gene_biotype=ncRNA;gene_synonym=ECK0167,JWR0225,t44;locus_tag=BW25113_4414
CP009273.1	Genbank	ncRNA	186199	186334	.	+	.	ID=rna-BW25113_4414;Parent=gene-BW25113_4414;Note=identified in a large scale screen;gbkey=ncRNA;gene=tff;locus_tag=BW25113_4414;product=novel sRNA%2C function unknown
CP009273.1	Genbank	exon	186199	186334	.	+	.	ID=exon-BW25113_4414-1;Parent=rna-BW25113_4414;Note=identified in a large scale screen;gbkey=ncRNA;gene=tff;locus_tag=BW25113_4414;product=novel sRNA%2C function unknown

E_coli_BW25113_genome.filtered.gtf The gtf produced by the pipeline looks like this: Note all of these entries are missing in the Salmon table (possibly because of the missing 'exon' line?)

CP009273.1	Genbank	transcript	190	255	.	+	.	transcript_id "gene-BW25113_0001"; gene_id "gene-BW25113_0001"; gene_name "thrL"; Name "thrL"; gbkey "Gene"; gene "thrL"; gene_biotype "protein_coding"; gene_synonym "ECK0001,JW4367"; locus_tag "BW25113_0001"
CP009273.1	Genbank	CDS	190	255	.	+	0	transcript_id "gene-BW25113_0001"; gene_name "thrL"; Dbxref "NCBI_GP:AIN30539.1"; Name "AIN30539.1"; gbkey "CDS"; gene "thrL"; locus_tag "BW25113_0001"; product "thr operon leader peptide"; protein_id "AIN30539.1"; transl_table "11";
CP009273.1	Genbank	transcript	337	2799	.	+	.	transcript_id "gene-BW25113_0002"; gene_id "gene-BW25113_0002"; gene_name "thrA"; Name "thrA"; gbkey "Gene"; gene "thrA"; gene_biotype "protein_coding"; gene_synonym "ECK0002,Hs,JW0001,thrA1,thrA2,thrD"; locus_tag "BW25113_0002"
CP009273.1	Genbank	CDS	337	2799	.	+	0	transcript_id "gene-BW25113_0002"; gene_name "thrA"; Dbxref "NCBI_GP:AIN30540.1"; Name "AIN30540.1"; Note "bifunctional: aspartokinase I (N-terminal)%3B homoserine dehydrogenase I (C-terminal)"; experiment "N-terminus verified by Edman degradation:PMID 354697%2C4562989"; gbkey "CDS"; gene "thrA"; locus_tag "BW25113_0002"; product "Bifunctional aspartokinase/homoserine dehydrogenase 1"; protein_id "AIN30540.1"; transl_table "11";
CP009273.1	Genbank	transcript	2801	3733	.	+	.	transcript_id "gene-BW25113_0003"; gene_id "gene-BW25113_0003"; gene_name "thrB"; Name "thrB"; gbkey "Gene"; gene "thrB"; gene_biotype "protein_coding"; gene_synonym "ECK0003,JW0002"; locus_tag "BW25113_0003"
CP009273.1	Genbank	CDS	2801	3733	.	+	0	transcript_id "gene-BW25113_0003"; gene_name "thrB"; Dbxref "NCBI_GP:AIN30541.1"; Name "AIN30541.1"; gbkey "CDS"; gene "thrB"; locus_tag "BW25113_0003"; product "homoserine kinase"; protein_id "AIN30541.1"; transl_table "11";
CP009273.1	Genbank	transcript	3734	5020	.	+	.	transcript_id "gene-BW25113_0004"; gene_id "gene-BW25113_0004"; gene_name "thrC"; Name "thrC"; gbkey "Gene"; gene "thrC"; gene_biotype "protein_coding"; gene_synonym "ECK0004,JW0003"; locus_tag "BW25113_0004"
CP009273.1	Genbank	CDS	3734	5020	.	+	0	transcript_id "gene-BW25113_0004"; gene_name "thrC"; Dbxref "NCBI_GP:AIN30542.1"; Name "AIN30542.1"; experiment "N-terminus verified by Edman degradation:PMID 9298646%2C9600841%2C9740056"; gbkey "CDS"; gene "thrC"; locus_tag "BW25113_0004"; product "L-threonine synthase"; protein_id "AIN30542.1"; transl_table "11";
CP009273.1	Genbank	transcript	5234	5530	.	+	.	transcript_id "gene-BW25113_0005"; gene_id "gene-BW25113_0005"; gene_name "yaaX"; Name "yaaX"; gbkey "Gene"; gene "yaaX"; gene_biotype "protein_coding"; gene_synonym "ECK0005,JW0004"; locus_tag "BW25113_0005"
CP009273.1	Genbank	CDS	5234	5530	.	+	0	transcript_id "gene-BW25113_0005"; gene_name "yaaX"; Dbxref "NCBI_GP:AIN30543.1"; Name "AIN30543.1"; gbkey "CDS"; gene "yaaX"; locus_tag "BW25113_0005"; product "DUF2502 family putative periplasmic protein"; protein_id "AIN30543.1"; transl_table "11";
...

I also tried adding 'exon' lines to the gff manually, which resulted in an RSEM error for not finding 'gene_id', possibly due to a broken gffread conversion:

Command error:
  INFO:    Converting SIF file to temporary sandbox...
  rsem-extract-reference-transcripts rsem/genome 0 E_coli_BW25113_genome.filtered.gtf None 0 rsem/E_coli_BW25113_genome.fasta
  The GTF file might be corrupted!
  Stop at line : CP009273.1	Genbank	exon	190	255	.	+	.	transcript_id "cds-AIN30539.1"; gene_name "thrL"; Dbxref "NCBI_GP:AIN30539.1"; Name "AIN30539.1"; gbkey "CDS"; gene "thrL"; locus_tag "BW25113_0001"; product "thr operon leader peptide"; protein_id "AIN30539.1"; transl_table "11";
  Error Message: Cannot find gene_id!
  "rsem-extract-reference-transcripts rsem/genome 0 E_coli_BW25113_genome.filtered.gtf None 0 rsem/E_coli_BW25113_genome.fasta" failed! Plase check if you provide correct parameters/options for the pipeline!
  INFO:    Cleaning up image...

Using gff is highly unstable in the pipeline, maybe another gtf conversion tool is needed

I also converted the gff to gtf with AGAT: agat_convert_sp_gff2gtf.pl --gff E_coli_BW25113_annotation.gff3 -o E_coli_BW25113_annotation_AGAT.gtf and the pipeline ran successfully, generating counts for all genes. It does unfortunately replace the gene names with 'agat-gene-1' etc (though maybe that can be fixed with --gtf_extra_attributes) EDIT: adding --gtf_extra_attributes gene did indeed rescue correct gene names in the output

Any ideas?

System information

N E X T F L O W   ~  version 24.10.4

Launching `https://github.com/nf-core/rnaseq` [suspicious_nobel] DSL2 - revision: b96a75361a [3.18.0]


------------------------------------------------------
                                       ,--./,-.
       ___     __   __   __   ___     /,-._.--~'
 |\ | |__  __ /  ` /  \ |__) |__         }  {
 | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                       `._,._,'
 nf-core/rnaseq 3.18.0
------------------------------------------------------
Input/output options
 input          : samplesheet.csv
 outdir         : results_rnaseq

Reference genome options
 genome         : null
 fasta          : E_coli_BW25113_genome.fasta
 gff            : E_coli_BW25113_annotation.gff
 igenomes_ignore: true

Process skipping options
 skip_rseqc     : true
 skip_biotype_qc: true

Core Nextflow options
 revision       : 3.18.0
 runName        : suspicious_nobel
 containerEngine: singularity
...
 profile        : singularity
 configFiles    :

Mar 24 '25 14:03 RaverJay

rnaseq rnaseq copied to clipboard

Salmon quantification output is missing genes when using gff3 annotation

Description of the bug

Command used and terminal output

Relevant files

System information

rnaseq
rnaseq copied to clipboard