goci icon indicating copy to clipboard operation
goci copied to clipboard

Investigate & fix duplication of SNPs

Open ljwh2 opened this issue 11 months ago • 6 comments

Occasionally SNPs are duplicated during the curation process. It looks like this happens on import to Oracle.

If the studies are not yet published, the data release breaks

If the studies are published, this causes issues in prod as the SNP is listed twice in UI and download, here in GCST90085780

Screenshot 2024-03-27 at 13.44.07.png Screenshot 2024-03-27 at 13.36.33.png

Some recent examples are rs71543110, rs199679345, see also goci#719

I did some quick analysis of the associations download, looks like there are 83 SNPs which are duplicated in prod. A quick check suggests all of these look like merged SNPs in our UI, with the variant ID appearing as rs1 (rs2) in the search snippet, but I can't verify this in Ensembl.

Screenshot 2024-03-27 at 14.03.16.png

All but 3 of them have the "merged" flag set to 0.

All are included in studies published in the Catalog after March 2022, although there are also examples of merged SNPs added recently that are correctly represented in UI e.g. https://www.ebi.ac.uk/gwas/search?query=rs138055607. Note this is around the time we switched to using depo-curation for routine curation workflow.

The full list of associations with duplicate SNPs (411 associations, 83 SNPs) is attached: assocs with SNP duplication.xlsx

This needs investigating and fixing such that -curators can extract SNPs as described in papers, which may include old or new rsIDs -unpublished SNPs do not break the DR -published SNPs appear only once in the UI & download

ljwh2 avatar Mar 27 '24 13:03 ljwh2

rs71543110, found in pmid:37500982 (GCST90321118) with status level 2 curation done, was causing issues during DR. For the current ongoing DR, I deleted all associations for this particular GCST. Also attached is the list of SNPs from this publication that were not found in Ensembl. SNP_not found in ensembl.xlsx

Santhi1901 avatar Mar 27 '24 15:03 Santhi1901

This PMID containing rs71543110 was published and didn't cause any issues during the DR. It was duplicated in UI.

Santhi1901 avatar Apr 16 '24 14:04 Santhi1901

Another example that caused the DR to fail: Association 131528254, Accession Id ‘GCST90428059’ Study Id ‘131528155’ , RsId 'rs575623373'

ljwh2 avatar Apr 19 '24 12:04 ljwh2

There's a join table called SNP_MERGED_SNP which contains duplicate values, this was the case for both published and unpublished studies that I have checked that have the duplicate SNPs or issues in DR respectively. I think deleting the duplicates would solve both issues, we can test this in next DR.

ala-ebi avatar Jul 03 '24 12:07 ala-ebi

Check after DR, if successful, make changes to mapping pipeline

ljwh2 avatar Jul 10 '24 10:07 ljwh2

this DR should fix all the existing duplicates and DR purge issues for unpublished studies, @sajo-ebi to add code ti prevent duplicating from happening in the mapping pipeline

ala-ebi avatar Jul 17 '24 09:07 ala-ebi