goci icon indicating copy to clipboard operation
goci copied to clipboard

mapping inconsistencies for snp x snp interactions

Open jdhayhurst opened this issue 2 years ago • 4 comments

From gwas-info:

I have a question on how the variants are mapped/recorded in the associations dataset. I feel there might be some inconsistencies in how the rs identifiers of the variants are mapped to the genome, or how this information is exported. I noticed this when I was looking at studies reporting SNP x SNP interactions. When looking at the table of variants and the available mappings for study GCST010340 it looks like this:

+---------------+---------------------------+------+--------------------+ |study_accession|strongest_snp_risk_allele |chr_id|chr_pos | +---------------+---------------------------+------+--------------------+ |GCST010340 |rs11883068-? x rs11004362-?|10 |54542614 | |GCST010340 |rs717954-? x rs131101-? |22 |47593511 | |GCST010340 |rs10515465-? x rs483643-? |5 |2597047 | |GCST010340 |rs9783424-? x rs12648659-? |4 |136183454 | |GCST010340 |rs11207233-? x rs11617400-?|13 |51003439 | |GCST010340 |rs724098-? x rs6864266-? |15 x 5|39480558 x 169071672| |GCST010340 |rs2044480-? x rs10433900-? |4 |113040655 | |GCST010340 |rs2820082-? x rs1481963-? |11 |103523199 | |GCST010340 |rs9783424-? x rs12130751-? |1 |51699553 | |GCST010340 |rs6443359-? x rs7995214-? |13 |35790619 | |GCST010340 |rs2112963-? x rs9967884-? |null |null | |GCST010340 |rs753620-? x rs1367319-? |8 |4371989 | |GCST010340 |rs11564100-? x rs7398412-? |12 |5937683 | |GCST010340 |rs35608719-? x rs11976985-?|7 |31794304 | |GCST010340 |rs16982730-? x rs17097403-?|11 |102064136 | |GCST010340 |rs2892152-? x rs2167460-? |null |null | |GCST010340 |rs11646672-? x rs362064-? |22 |20378519 | |GCST010340 |rs7897805-? x rs285040-? |13 |98103808 | |GCST010340 |rs3915139-? x rs10758721-? |9 |5928434 | |GCST010340 |rs12620507-? x rs17193610-?|3 |1522260 | +---------------+---------------------------+------+--------------------+ only showing top 20 rows

We can identify three distinct cases: Both variant has mappings eg. rs724098-? x rs6864266-? – this totally makes sense, consistent with cases when multiple variants are reported for a single association. By position, we are able to link the rsid with the corresponding mapping. No mapping is reported. I understand this, however for some of the examples I looked at, these variants have actual mappings. rs2112963-? x rs9967884-? [1][2]. There are mappings, moreover for both cases the Ensembl properly links the phenotype to the GWAS Catalog. Only one of the variant is mapped to the genome. Most of the variants, this is the case. Eg. rs717954-? x rs131101-? However both variants have mappings. [3][4]. In this latter case, we don't really have any means to link the available mapping to any of the rsIDs.

For me this suggests there might be some fragility in the mapping pipeline, which sometimes fails silently. But this is just my gut feeling. However the above demonstrated discrepancies certainly real and goes beyond snp x snp interactions. The accuracy of these mappings are becoming increasingly important for us, as we won't be able to rescue associations based on their rs identifier anymore (the size of GnomAD 3, just prohibitively large).

Thank you so much looking into this issue.

[1] rs2112963: 5:40214036-40214036 -> https://rest.ensembl.org/variation/human/rs2112963.json?phenotypes=1 [2] rs9967884: 2:30265317-30265317 -> https://rest.ensembl.org/variation/human/rs9967884.json?phenotypes=1 [3] rs717954: 5:149601466-149601466 -> https://rest.ensembl.org/variation/human/rs717954.json?phenotypes=1 [4] rs131101: 22:47593511-47593511 -> https://rest.ensembl.org/variation/human/rs131101.json?phenotypes=1

I can confirm that it doesn't look correct. Based on a quick look at the associations table: https://www.ebi.ac.uk/gwas/studies/GCST010340 I notice a pattern occurs where if the first snp is not mapped to a gene, the second snp will not be mapped to a position. Could it be that there's a simple exception handling bug in the code?

jdhayhurst avatar Aug 23 '22 14:08 jdhayhurst

Same issue as goci#577?

ljwh2 avatar Aug 23 '22 16:08 ljwh2

They could be related, but the issue here is not the gene mappings, it's the coordinate mappings

jdhayhurst avatar Aug 24 '22 08:08 jdhayhurst

Hi @ljwh2 , @jdhayhurst !

Sorry for the unsolicited comment, but I think it would be better to make things a bit more clear: it seems there's some fragility in the mapping pipelines when the rsIDs are being resolved to chromosome basepair location via the REST API of Ensembl. I noticed this behaviour when I took a look at the above example. However I know it is very difficult to troubleshoot such a bug, so I collected more examples.

I fetched all associations from the following releases (implicitely I was assumeing, there were at least one remapping event between these releases, however I cannot be sure for obvious reasons):

  • 22_08_30
  • 22_01_26
  • 21_08_17
  • 21_01_15
  • 20_03_09

I have then extracted all the rsIDs and mappings from all files, joined these files together by rsIDs. It would then show if a given rsID was successfully mapped in which releases. Apparently there are a large number (2k+) variants where the mapping was not consistent across these releases.

An example:

+-----------+--------+--------+--------+--------+--------+
|       rsId|22_08_30|22_01_26|21_08_17|21_01_15|20_03_09|
+-----------+--------+--------+--------+--------+--------+
|  rs7542186|    True|    True|    True|    True|   False|
| rs10980231|    True|    True|    True|    True|   False|
|    rs38059|    True|    True|    True|    True|   False|
|   rs273966|    True|    True|    True|    True|   False|
|  rs7141987|    True|    True|    True|    True|   False|
| rs28525613|    True|    True|    True|    True|   False|
| rs13017207|    True|    True|    True|    True|   False|
| rs13182707|   False|   False|   False|    True|    True|
|   rs531071|    True|    True|    True|    True|   False|
|rs117259798|    True|    True|    True|    True|   False|
| rs13221259|    True|    True|    True|    True|   False|
|  rs7084828|    True|    True|    True|    True|   False|
|  rs9591325|    True|    True|   False|    True|    True|
| rs10773302|    True|    True|    True|    True|   False|
| rs12472381|    True|    True|    True|    True|   False|
|  rs7166435|    True|    True|    True|    True|   False|
|rs117243822|    True|    True|    True|    True|   False|
|  rs4434676|   False|   False|   False|    True|    True|
|  rs6563812|    True|    True|    True|    True|   False|
|  rs6882842|    True|    True|    True|    True|   False|
+-----------+--------+--------+--------+--------+--------+
only showing top 20 rows

I suspect sometimes, if the REST API endpoint doesn't give you the response quick enough the application assumes no mapping. From experience, the Ensembl API is not particularly robust, especially the simpel GET ones. But this is just an assumption. I'm wondering how complicated it would be to migrate to static sources based on database dumps or flatfiles. (we are mapping our variant set to the downloadable GnomAD3 dataset via join)

The full list of such ambiguous cases can be download from here.

Thank you so much againg for looking into the issue for us!

DSuveges avatar Aug 31 '22 16:08 DSuveges

Thanks @DSuveges - I think the idea of using a database dump or one of the Ensembl db mirrors is a much better idea than the REST API and could indeed be the reason why we're seeing these inconsistencies. Tagging @sprintell as he's doing the investigation into the mapping pipeline.

jdhayhurst avatar Sep 01 '22 09:09 jdhayhurst

@sajo-ebi many of these are fixed but some still give only one location, where two should be reported (e.g. rs784411 x rs12418451). However, a quick check suggests this could be because of the Ensembl error for variants with multiple locations (goci#1157). @ljwh2 to review again when that issue is resolved.

ljwh2 avatar Oct 05 '23 16:10 ljwh2

@ljwh2 is this resolved now ?

sprintell avatar Nov 01 '23 11:11 sprintell

Verified: all SNP x SNP interactions that can be mapped to a gene, have been successfully mapped in the last release. In a very few cases, e.g. rs11008099, rs10830963 no location has been returned even though the variants exist in ensembl. I will add these to a separate ticket, this one can be closed.

ljwh2 avatar Nov 07 '23 18:11 ljwh2