goci icon indicating copy to clipboard operation
goci copied to clipboard

Ensembl mapping pipeline - incorrect Y-RNA mapping

Open earlEBI opened this issue 2 years ago • 10 comments

From gwas-info:

"I downloaded the catalog data from your website (gwas-association-downloaded_2022-05-19-EFO_0000729.tsv) and unfortunately recognised, that all occurrences of the gene name Y_RNA are mapped to ENSG00000199357, which is at least in most cases most unlikely (there are several Y_RNA genes across all chromosomes)."

Searching Ensembl I can see ENSG00000199357 is on chr18: https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000199357;r=18:23456371-23456467;t=ENST00000362487

but is in the upstream / downstream mapping column for many associations on different chromosomes:

Screenshot 2022-05-19 at 10.08.11.png

earlEBI avatar May 19 '22 10:05 earlEBI

follow-up email from user: "it is also true for the gene SNORD116 which is currently mapped to ENSG00000252985 on chr 9 instead of the correct gene ENSG00000212553 on chr 13."

earlEBI avatar May 19 '22 12:05 earlEBI

@sajo-ebi In order to know whether this is an issue or not, I need to understand how the UPSTREAM_GENE_ID, DOWNSTREAM_GENE_ID and SNP_GENE_ID columns are generated. I suspect the mapping is done SNP -> MAPPED_GENE -> GENE_ID, but it should be SNP-> GENE_ID directly

ljwh2 avatar Jun 20 '23 16:06 ljwh2

@sprintell I need the process flow documentation to define this issue properly

ljwh2 avatar Oct 05 '23 16:10 ljwh2

  • Get Location & Region mapping for the Variant .
  • Get overlapping genes for each location of chromosome.
  • If no overlapping genes are found then based on chromosome position we determine to pull upstream or downstream genes
  • There is algorithm to determine nearest gene when no closest gene is return from the API call
  • Genomic Context information is created using upstream,downstream,closest gene infromation
  • The IDs are the ensembl id retreived for the Genes

sajo-ebi avatar Dec 11 '23 00:12 sajo-ebi

Thanks @sajo-ebi. To clarify in step 2, "get overlapping genes", is this the gene name? And then the final step is to retrieve the ensembl id for the retrieved genes?

ljwh2 avatar Dec 11 '23 17:12 ljwh2

@ljwh2 the overlapping genes are the ones which give match for the chromosome position , the Ensembl Id & gene information is retrieved in 2nd step itself , then the gene information for upstream & downstream genes are determined ,

sajo-ebi avatar Dec 13 '23 00:12 sajo-ebi

@ljwh2 to investigate further and provide Sajo with rsIDs to investigate

ljwh2 avatar Dec 13 '23 10:12 ljwh2

Some examples: rs5758209. This has genomic location chr22:41065861, upstream gene Y_RNA. Upstream gene ID is ENSG00000201314, but this ID maps to a genomic location on chr4.

rs1705773 SNP maps to genomic location chr12:34016940, upstream gene Y_RNA Upstream gene ID is ENSG00000201314, with genomic location on chr4

The problem is that there are several genes all called Y_RNA on different genomic locations and with different gene IDs. But we give the same gene ID for all.

ljwh2 avatar Jan 08 '24 15:01 ljwh2

CHR_ID CHR_POS REPORTED GENE(S) MAPPED_GENE UPSTREAM_GENE_ID DOWNSTREAM_GENE_ID SNP_GENE_IDS UPSTREAM_GENE_DISTANCE DOWNSTREAM_GENE_DISTANCE STRONGEST SNP-RISK ALLELE SNPS 22 41065861 ST13 Y_RNA - ACTBP15 ENSG00000201749 ENSG00000213857 215 8319 rs5758209-T rs5758209

@ljwh2 the above is the catalog download file entry for the snp 'rs5758209' , it shows the upstream geneId of 'ENSG00000201749' , when you say the UPstream Id is 'ENSG00000201314' , where did you get this value from ?

sajo-ebi avatar Jan 24 '24 16:01 sajo-ebi

The problem is that there are several genes all called Y_RNA on different genomic locations, each having a different gene ID. But we give the same gene ID for all.

Based on the information so far, I would guess each time the mapping is done, it picks one gene ID (presumably the first one it encounters) and applies it to all the instances of gene name = Y_RNA. The original bug report from the user said that all were mapped to ENSG00000199357, in December I found they were all mapped to ENSG00000201314, Sajo found they are mapped to ENSG00000201749, in latest release they are mapped to ENSG00000201343. But on all occasions, they are mapped to the same gene ID, when there should be several different ones.

rs5758209 - This has genomic location chr22:41065861, upstream gene Y_RNA. I think the gene ID should be ENSG00000199515 rs1705773 - SNP maps to genomic location chr12:34016940, upstream gene Y_RNA. I think the gene ID should be ENSG00000201624 rs6671332 - SNP maps to genomic location chr1:161702661, upstream gene Y_RNA. I think the gene ID should be ENSG00000199595

ljwh2 avatar Feb 15 '24 17:02 ljwh2