Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf 'gene_id' field refers to gene symbol, not Ensembl id
I tried running nf-core/tfactivity -r dev with --genomes GRCh38 and it is successfully staging the genes.gtf and genome.fa file, however I am getting an error that gene_id is not a valid field. Upon inspection of the staged .gtf file, it appears that the gene_id field and the gene_name field are both referring to the gene symbol.
tgross2@login1:/projects/academic/rpili/HDACi_anti-PD1/scripts/integrated/tfactivity/work/d9/68edbb8a97b70f95a3253e42abe9d3$ head -n 3 genes.gtf
chr1 BestRefSeq exon 11874 12227 . + . gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672";
chr1 BestRefSeq exon 12613 12721 . + . gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672";
chr1 BestRefSeq exon 13221 14409 . + . gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672";
Running the nf-core/tfactivity with -profile test,singularity was a success and the .gtf file used has a gene_id field that corresponds to the Ensembl ID
tgross2@login1:/projects/academic/rpili/HDACi_anti-PD1/scripts/integrated/tfactivity/work/9f/05391a7625e85c9d2a9f9a8f7b0454$ head -n 3 chr1.gtf
chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";
chr1 HAVANA transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
chr1 HAVANA exon 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; exon_number 1; exon_id "ENSMUSE00001343744.1"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
Any recommendations on how to resolve this?
Here's the referenced issue: https://github.com/nf-core/tfactivity/issues/17
Hey, the problem was that tfactivity expected gene feature rows to be present in the GTF file. This is not the case for this, and some/many other igenomes. I know igenomes is outdated and should not be used - and also this issue will not be fixed.
However, for users of tfactivity I fixed it by using agat_convertspgxf2gxf to infer the missing rows.