goci
goci copied to clipboard
Investigate & fix duplication of SNPs
Occasionally SNPs are duplicated during the curation process. It looks like this happens on import to Oracle.
If the studies are not yet published, the data release breaks
If the studies are published, this causes issues in prod as the SNP is listed twice in UI and download, here in GCST90085780


Some recent examples are rs71543110, rs199679345, see also goci#719
I did some quick analysis of the associations download, looks like there are 83 SNPs which are duplicated in prod. A quick check suggests all of these look like merged SNPs in our UI, with the variant ID appearing as rs1 (rs2) in the search snippet, but I can't verify this in Ensembl.

All but 3 of them have the "merged" flag set to 0.
All are included in studies published in the Catalog after March 2022, although there are also examples of merged SNPs added recently that are correctly represented in UI e.g. https://www.ebi.ac.uk/gwas/search?query=rs138055607. Note this is around the time we switched to using depo-curation for routine curation workflow.
The full list of associations with duplicate SNPs (411 associations, 83 SNPs) is attached: assocs with SNP duplication.xlsx
This needs investigating and fixing such that -curators can extract SNPs as described in papers, which may include old or new rsIDs -unpublished SNPs do not break the DR -published SNPs appear only once in the UI & download
rs71543110, found in pmid:37500982 (GCST90321118) with status level 2 curation done, was causing issues during DR. For the current ongoing DR, I deleted all associations for this particular GCST. Also attached is the list of SNPs from this publication that were not found in Ensembl. SNP_not found in ensembl.xlsx
This PMID containing rs71543110 was published and didn't cause any issues during the DR. It was duplicated in UI.
Another example that caused the DR to fail: Association 131528254, Accession Id ‘GCST90428059’ Study Id ‘131528155’ , RsId 'rs575623373'
There's a join table called SNP_MERGED_SNP which contains duplicate values, this was the case for both published and unpublished studies that I have checked that have the duplicate SNPs or issues in DR respectively. I think deleting the duplicates would solve both issues, we can test this in next DR.
Check after DR, if successful, make changes to mapping pipeline
this DR should fix all the existing duplicates and DR purge issues for unpublished studies, @sajo-ebi to add code ti prevent duplicating from happening in the mapping pipeline