AWS-iGenomes icon indicating copy to clipboard operation
AWS-iGenomes copied to clipboard

Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf 'gene_id' field refers to gene symbol, not Ensembl id

Open tylergross97 opened this issue 6 months ago • 2 comments

I tried running nf-core/tfactivity -r dev with --genomes GRCh38 and it is successfully staging the genes.gtf and genome.fa file, however I am getting an error that gene_id is not a valid field. Upon inspection of the staged .gtf file, it appears that the gene_id field and the gene_name field are both referring to the gene symbol.

tgross2@login1:/projects/academic/rpili/HDACi_anti-PD1/scripts/integrated/tfactivity/work/d9/68edbb8a97b70f95a3253e42abe9d3$ head -n 3 genes.gtf 
chr1	BestRefSeq	exon	11874	12227	.	+	.	gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672";
chr1	BestRefSeq	exon	12613	12721	.	+	.	gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672";
chr1	BestRefSeq	exon	13221	14409	.	+	.	gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672";

Running the nf-core/tfactivity with -profile test,singularity was a success and the .gtf file used has a gene_id field that corresponds to the Ensembl ID

tgross2@login1:/projects/academic/rpili/HDACi_anti-PD1/scripts/integrated/tfactivity/work/9f/05391a7625e85c9d2a9f9a8f7b0454$ head -n 3 chr1.gtf 
chr1	HAVANA	gene	3073253	3074322	.	+	.	gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";
chr1	HAVANA	transcript	3073253	3074322	.	+	.	gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
chr1	HAVANA	exon	3073253	3074322	.	+	.	gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; exon_number 1; exon_id "ENSMUSE00001343744.1"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

Any recommendations on how to resolve this?

tylergross97 avatar Jun 05 '25 16:06 tylergross97

Here's the referenced issue: https://github.com/nf-core/tfactivity/issues/17

tylergross97 avatar Jun 05 '25 17:06 tylergross97

Hey, the problem was that tfactivity expected gene feature rows to be present in the GTF file. This is not the case for this, and some/many other igenomes. I know igenomes is outdated and should not be used - and also this issue will not be fixed.

However, for users of tfactivity I fixed it by using agat_convertspgxf2gxf to infer the missing rows.

nictru avatar Jul 11 '25 18:07 nictru