rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

Salmon quantification output is missing genes when using gff3 annotation

Open RaverJay opened this issue 7 months ago • 1 comments

Description of the bug

Trying to use a gff3 annotation for the pipeline, the output table is missing most of the genes.

I suspect it's something with the gff to gtf conversion with gffread, possibly using only entries with 'exon' lines

Command used and terminal output

nextflow run \
    nf-core/rnaseq -r 3.18.0 \
    -profile singularity \
    -resume \
    --input samplesheet.csv \
    --outdir results_rnaseq \
    --gff E_coli_BW25113_annotation.gff \
    --fasta E_coli_BW25113_genome.fasta \
    --igenomes_ignore \
    --genome null \
    --skip_biotype_qc \
    --skip_rseqc

Relevant files

E_coli_BW25113_annotation.gff

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region CP009273.1 1 4631469
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=679895
CP009273.1	Genbank	region	1	4631469	.	+	.	ID=CP009273.1:1..4631469;Dbxref=taxon:679895;Is_circular=true;Name=ANONYMOUS;collection-date=2000;country=USA: Indiana;culture-collection=CGSC:7636;gbkey=Src;genome=chromosome;genotype=rrnB3 lacZ4787 hsdR514 (araBAD)567 (rhaBAD)568 rph-1;mol_type=genomic DNA;note=from B. L. Wanner laboratory;strain=K-12;substrain=BW25113
CP009273.1	Genbank	gene	190	255	.	+	.	ID=gene-BW25113_0001;Name=thrL;gbkey=Gene;gene=thrL;gene_biotype=protein_coding;gene_synonym=ECK0001,JW4367;locus_tag=BW25113_0001
CP009273.1	Genbank	CDS	190	255	.	+	0	ID=cds-AIN30539.1;Parent=gene-BW25113_0001;Dbxref=NCBI_GP:AIN30539.1;Name=AIN30539.1;gbkey=CDS;gene=thrL;locus_tag=BW25113_0001;product=thr operon leader peptide;protein_id=AIN30539.1;transl_table=11
CP009273.1	Genbank	gene	337	2799	.	+	.	ID=gene-BW25113_0002;Name=thrA;gbkey=Gene;gene=thrA;gene_biotype=protein_coding;gene_synonym=ECK0002,Hs,JW0001,thrA1,thrA2,thrD;locus_tag=BW25113_0002
CP009273.1	Genbank	CDS	337	2799	.	+	0	ID=cds-AIN30540.1;Parent=gene-BW25113_0002;Dbxref=NCBI_GP:AIN30540.1;Name=AIN30540.1;Note=bifunctional: aspartokinase I (N-terminal)%3B homoserine dehydrogenase I (C-terminal);experiment=N-terminus verified by Edman degradation:PMID 354697%2C4562989;gbkey=CDS;gene=thrA;locus_tag=BW25113_0002;product=Bifunctional aspartokinase/homoserine dehydrogenase 1;protein_id=AIN30540.1;transl_table=11
...

contains 4k+ genes

tx2gene.tsv has 4491 lines

salmon.merged.gene_counts.tsv

gene_id	gene_name	dwaaC_minusCaL_i	dwaaC_minusCaL_ii	dwaaC_minusCaL_iii	dwaaC_plusCaL_i	dwaaC_plusCaL_ii	dwaaC_plusCaL_iii	WT_minusCaL_i	WT_minusCaL_iiWT_minusCaL_iii	WT_plusCaL_i	WT_plusCaL_ii	WT_plusCaL_iii
gene-BW25113_0201	rrsH	32694.197	31574.414	38547.193	30267.249	52068.988	31854.076	26227.147	20798.888	192851.168	16902.049	19981.584	19476.323
gene-BW25113_0202	ileV	0	0	0	0	0	0	0	9	60	0	12
gene-BW25113_0203	alaV	38.729	18.365	31.365	63.391	58.096	24.644	24.578	0	10.59	38.481	34.433	0
gene-BW25113_0204	rrlH	92145.698	84139.543	113628.01	74832.386	110317.385	69206.269	111249.675	84373.659	397041.287	53441.406	53113.014	52625.037
gene-BW25113_0205	rrfH	193.193	96.878	0	128.627	274.961	160.568	89.404	184.567	623.901	87.854	0	88.185
gene-BW25113_0206	aspU	123.398	0	0	116.261	121.072	107.542	56.663	57.32225.359	71.551	0	96.843
gene-BW25113_0216	aspV	123.832	178	249	33.574	0	0.001	56.632	64.08825.314	0	0	0
gene-BW25113_0244	thrW	33	17	22	25	21	17	6	10	918	12	11
gene-BW25113_0455	ffs	3844	2055	2320	3252	3457	2746	1896	2116	912	1953	2123	2206
...

has only 178 lines

  • it is missing all entries from the gff with only 'gene' and 'CDS' lines (most genes)
  • includes only the entries with 'gene', 'ncrna' (or other) AND 'exon' lines, e.g.
CP009273.1	Genbank	gene	186199	186334	.	+	.	ID=gene-BW25113_4414;Name=tff;gbkey=Gene;gene=tff;gene_biotype=ncRNA;gene_synonym=ECK0167,JWR0225,t44;locus_tag=BW25113_4414
CP009273.1	Genbank	ncRNA	186199	186334	.	+	.	ID=rna-BW25113_4414;Parent=gene-BW25113_4414;Note=identified in a large scale screen;gbkey=ncRNA;gene=tff;locus_tag=BW25113_4414;product=novel sRNA%2C function unknown
CP009273.1	Genbank	exon	186199	186334	.	+	.	ID=exon-BW25113_4414-1;Parent=rna-BW25113_4414;Note=identified in a large scale screen;gbkey=ncRNA;gene=tff;locus_tag=BW25113_4414;product=novel sRNA%2C function unknown

E_coli_BW25113_genome.filtered.gtf The gtf produced by the pipeline looks like this: Note all of these entries are missing in the Salmon table (possibly because of the missing 'exon' line?)

CP009273.1	Genbank	transcript	190	255	.	+	.	transcript_id "gene-BW25113_0001"; gene_id "gene-BW25113_0001"; gene_name "thrL"; Name "thrL"; gbkey "Gene"; gene "thrL"; gene_biotype "protein_coding"; gene_synonym "ECK0001,JW4367"; locus_tag "BW25113_0001"
CP009273.1	Genbank	CDS	190	255	.	+	0	transcript_id "gene-BW25113_0001"; gene_name "thrL"; Dbxref "NCBI_GP:AIN30539.1"; Name "AIN30539.1"; gbkey "CDS"; gene "thrL"; locus_tag "BW25113_0001"; product "thr operon leader peptide"; protein_id "AIN30539.1"; transl_table "11";
CP009273.1	Genbank	transcript	337	2799	.	+	.	transcript_id "gene-BW25113_0002"; gene_id "gene-BW25113_0002"; gene_name "thrA"; Name "thrA"; gbkey "Gene"; gene "thrA"; gene_biotype "protein_coding"; gene_synonym "ECK0002,Hs,JW0001,thrA1,thrA2,thrD"; locus_tag "BW25113_0002"
CP009273.1	Genbank	CDS	337	2799	.	+	0	transcript_id "gene-BW25113_0002"; gene_name "thrA"; Dbxref "NCBI_GP:AIN30540.1"; Name "AIN30540.1"; Note "bifunctional: aspartokinase I (N-terminal)%3B homoserine dehydrogenase I (C-terminal)"; experiment "N-terminus verified by Edman degradation:PMID 354697%2C4562989"; gbkey "CDS"; gene "thrA"; locus_tag "BW25113_0002"; product "Bifunctional aspartokinase/homoserine dehydrogenase 1"; protein_id "AIN30540.1"; transl_table "11";
CP009273.1	Genbank	transcript	2801	3733	.	+	.	transcript_id "gene-BW25113_0003"; gene_id "gene-BW25113_0003"; gene_name "thrB"; Name "thrB"; gbkey "Gene"; gene "thrB"; gene_biotype "protein_coding"; gene_synonym "ECK0003,JW0002"; locus_tag "BW25113_0003"
CP009273.1	Genbank	CDS	2801	3733	.	+	0	transcript_id "gene-BW25113_0003"; gene_name "thrB"; Dbxref "NCBI_GP:AIN30541.1"; Name "AIN30541.1"; gbkey "CDS"; gene "thrB"; locus_tag "BW25113_0003"; product "homoserine kinase"; protein_id "AIN30541.1"; transl_table "11";
CP009273.1	Genbank	transcript	3734	5020	.	+	.	transcript_id "gene-BW25113_0004"; gene_id "gene-BW25113_0004"; gene_name "thrC"; Name "thrC"; gbkey "Gene"; gene "thrC"; gene_biotype "protein_coding"; gene_synonym "ECK0004,JW0003"; locus_tag "BW25113_0004"
CP009273.1	Genbank	CDS	3734	5020	.	+	0	transcript_id "gene-BW25113_0004"; gene_name "thrC"; Dbxref "NCBI_GP:AIN30542.1"; Name "AIN30542.1"; experiment "N-terminus verified by Edman degradation:PMID 9298646%2C9600841%2C9740056"; gbkey "CDS"; gene "thrC"; locus_tag "BW25113_0004"; product "L-threonine synthase"; protein_id "AIN30542.1"; transl_table "11";
CP009273.1	Genbank	transcript	5234	5530	.	+	.	transcript_id "gene-BW25113_0005"; gene_id "gene-BW25113_0005"; gene_name "yaaX"; Name "yaaX"; gbkey "Gene"; gene "yaaX"; gene_biotype "protein_coding"; gene_synonym "ECK0005,JW0004"; locus_tag "BW25113_0005"
CP009273.1	Genbank	CDS	5234	5530	.	+	0	transcript_id "gene-BW25113_0005"; gene_name "yaaX"; Dbxref "NCBI_GP:AIN30543.1"; Name "AIN30543.1"; gbkey "CDS"; gene "yaaX"; locus_tag "BW25113_0005"; product "DUF2502 family putative periplasmic protein"; protein_id "AIN30543.1"; transl_table "11";
...

I also tried adding 'exon' lines to the gff manually, which resulted in an RSEM error for not finding 'gene_id', possibly due to a broken gffread conversion:

Command error:
  INFO:    Converting SIF file to temporary sandbox...
  rsem-extract-reference-transcripts rsem/genome 0 E_coli_BW25113_genome.filtered.gtf None 0 rsem/E_coli_BW25113_genome.fasta
  The GTF file might be corrupted!
  Stop at line : CP009273.1	Genbank	exon	190	255	.	+	.	transcript_id "cds-AIN30539.1"; gene_name "thrL"; Dbxref "NCBI_GP:AIN30539.1"; Name "AIN30539.1"; gbkey "CDS"; gene "thrL"; locus_tag "BW25113_0001"; product "thr operon leader peptide"; protein_id "AIN30539.1"; transl_table "11";
  Error Message: Cannot find gene_id!
  "rsem-extract-reference-transcripts rsem/genome 0 E_coli_BW25113_genome.filtered.gtf None 0 rsem/E_coli_BW25113_genome.fasta" failed! Plase check if you provide correct parameters/options for the pipeline!
  INFO:    Cleaning up image...

Using gff is highly unstable in the pipeline, maybe another gtf conversion tool is needed

I also converted the gff to gtf with AGAT: agat_convert_sp_gff2gtf.pl --gff E_coli_BW25113_annotation.gff3 -o E_coli_BW25113_annotation_AGAT.gtf and the pipeline ran successfully, generating counts for all genes. It does unfortunately replace the gene names with 'agat-gene-1' etc (though maybe that can be fixed with --gtf_extra_attributes) EDIT: adding --gtf_extra_attributes gene did indeed rescue correct gene names in the output

Any ideas?

System information

N E X T F L O W   ~  version 24.10.4

Launching `https://github.com/nf-core/rnaseq` [suspicious_nobel] DSL2 - revision: b96a75361a [3.18.0]


------------------------------------------------------
                                       ,--./,-.
       ___     __   __   __   ___     /,-._.--~'
 |\ | |__  __ /  ` /  \ |__) |__         }  {
 | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                       `._,._,'
 nf-core/rnaseq 3.18.0
------------------------------------------------------
Input/output options
 input          : samplesheet.csv
 outdir         : results_rnaseq

Reference genome options
 genome         : null
 fasta          : E_coli_BW25113_genome.fasta
 gff            : E_coli_BW25113_annotation.gff
 igenomes_ignore: true

Process skipping options
 skip_rseqc     : true
 skip_biotype_qc: true

Core Nextflow options
 revision       : 3.18.0
 runName        : suspicious_nobel
 containerEngine: singularity
...
 profile        : singularity
 configFiles    : 

RaverJay avatar Mar 24 '25 14:03 RaverJay