MungeSumstats icon indicating copy to clipboard operation
MungeSumstats copied to clipboard

Large number of non-biallelic SNPs

Open bschilder opened this issue 1 year ago • 14 comments

I'm munging a bunch of OpenGWAS data with MSS, and i noticed that the number of variants dropped due to being non-biallelic was quite high (hundreds of thousands-millions). In some cases, up to half the original SNPs are being dropped because they’re non-biallelic.

Reprex

gwas_paths <- MungeSumstats::import_sumstats(
  ids = "ieu-a-8", 
  save_dir = here::here("data/GWAS_sumstats"), 
  nThread = 30, # >30 causes issues with read_vcf_parallel
  parallel_across_ids = FALSE, 
  force_new_vcf = FALSE,
  force_new = FALSE,
  vcf_download = TRUE,
  vcf_dir = here::here("data/VCFs"), 
  #### Record logs
  log_folder_ind = TRUE,
  log_mungesumstats_msgs = TRUE,
  ) 

Example 1: ieu-a-8

In this case, a GWAS with 2.4M variants had 1.2M+ non-bilallelic variants removed: mss.txt

Example 2: Vuckovic2020

Here’s another where 5M/44M variant were considered non-biallelic: mss_vuckovic.txt

Literature

  • Campbell2016: Notes that even at the higher estimates we’d only expect multi-allelic sites in 3% of SNPs.
  • Phillips2020: Found only 0.3% of SNPs in 1KG Phase 3 were tri-allelic.
  • Phillips2016: Found only 961 SNPs in 1KG Phase 3 were tetra-allelic.

Possible explanations

  1. MSS is over-attributing multi-allelic status to variants somehow,
  2. The updates to Bioc's dbSNP ref contains lots more reported multi-allelic sites that in previous versions.
  3. Something went awry when making the latest Bioc dbSNP dataset.
  4. This is the true number of multi-allelic variants and it is far above the estimates from single-cohort studies (e.g. 1KGp3) bc it incorporates info from across many different datasets.

To do

  • [x] Double check relevant MSS code. @Al-Murphy
  • [x] Compare number of multi-allelic SNPs in old vs new Bioc dbSNP datasets. @Al-Murphy
  • [ ] Check the %/# non-ballelic variants that were filtered in previous versions of MSS. @bschilder
  • [x] Check the %/# non-ballelic variants that were filtered using the latest version of MSS. @bschilder

bschilder avatar Aug 04 '22 17:08 bschilder

@bschilder can you add the top 10 or so rows of the log file of the dropped non-biallelic SNPs from Example 1: ieu-a-8 or Example 2: Vuckovic2020?

Al-Murphy avatar Aug 04 '22 19:08 Al-Murphy

@bschilder can you add the top 10 or so rows of the log file of the dropped non-biallelic SNPs from Example 1: ieu-a-8 or Example 2: Vuckovic2020?

Sure!

bschilder avatar Aug 04 '22 19:08 bschilder

I have checked MSS code and it seems to work as I suspect it to, it appears that there are now just a lot more alternative alleles in bioconductor's dbSNP 155 vs 144. See the example below where dbSNP 144 finds no non-biallelic but 155 finds 48 from a total of 93:

#MSS showing a lot is non-biallelic dbSNP 155 vs 144
sumstats_dt <- MungeSumstats::formatted_example()
rsids <- MungeSumstats:::load_ref_genome_data(snps = sumstats_dt$SNP,
                                              ref_genome = "GRCH38",
                                              dbSNP=144)
rsids155 <- MungeSumstats:::load_ref_genome_data(snps = sumstats_dt$SNP,
                                                 ref_genome = "GRCH38",
                                                 dbSNP=155)

# get chars for SNPs not bi/tri allelic
# or strand ambig from IUPAC_CODE_MAP
nonambig_IUPAC_CODE_MAP <-
  names(Biostrings::IUPAC_CODE_MAP[nchar(
    Biostrings::IUPAC_CODE_MAP
  ) < 3])

data.table::setkey(rsids, SNP)
data.table::setkey(rsids155, SNP)
bad_ids <- rsids[!alleles_as_ambig %in% nonambig_IUPAC_CODE_MAP]
print(nrow(bad_ids))
bad_ids155 <- rsids155[!alleles_as_ambig %in% nonambig_IUPAC_CODE_MAP]
print(nrow(bad_ids155))
print(bad_ids155)

There is no github page for bioconductor's dbSNP 155 so I'll just message Herve to ask his opinion on what's going on

Al-Murphy avatar Aug 04 '22 19:08 Al-Murphy

I have checked MSS code and it seems to work as I suspect it to, it appears that there are now just a lot more alternative alleles in bioconductor's dbSNP 155 vs 144. See the example below where dbSNP 144 finds no non-biallelic but 155 finds 48 from a total of 93:

Ok, well that at least narrows thing down. Thanks!

There is no github page for bioconductor's dbSNP 155 so I'll just message Herve to ask his opinion on what's going on

That would be great, thanks!

bschilder avatar Aug 04 '22 19:08 bschilder

I randomly picked 3 SNPs from my example code above and checked them in NCBI dbSNP online portal (dbSNP 155) and they all appear to be non-biallelic as described:

rs9824539 rs9556958 rs7854982

I know this is by no means definitive but does seem like it is doing things correctly

Al-Murphy avatar Aug 04 '22 20:08 Al-Murphy

Had a look into all the RS ID's from dbSNP 144 to check the proportion that are non-biallelic across the dbSNP 155 database and 144. I used an approach separate to MSS to ensure the issue isn't there. This approach was given by @hpages (see below).

It appears there is just a substantial number more non-biallelic SNPs in 155. This doesn't seem to be a MSS issue and just reflects the changes in the versions.

I'll post about this in https://github.com/hpages/SNPlocsForge in case anyone else has further insight but I believe the field's focus on biallelic SNPs may need to change as the number of non-biallelic SNPs increase. ~24% is a large proportion of SNPs to exclude.

I'll leave this issue open for now for others to see.

#wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144.txt.gz"
#gunzip snp144.txt.gz
snps <- limma::read.columns("~/Downloads/snp144.txt", 
                            required.col=c("rs775809821"))$rs775809821
#missing header so add in SNP
snps <- c("rs775809821",snps)

library(BSgenome)
extractAlleles <- function(rsids, dbSNP, ref_genome="GRCh38")
{
  pkgname <- paste0("SNPlocs.Hsapiens.dbSNP", dbSNP, ".", ref_genome)
  pkgenv <- loadNamespace(pkgname)
  snplocs <- get(pkgname, envir=pkgenv, inherits=FALSE)
  snps <- snpsById(snplocs, rsids, ifnotfound="drop")
  unname(setNames(IUPAC_CODE_MAP[mcols(snps)$alleles_as_ambig],
                  mcols(snps)$RefSNP_id)[rsids])
}

dbSNP144_alleles <- extractAlleles(snps, 144)
dbSNP155_alleles <- extractAlleles(snps, 155)
alleles <- data.frame(rsid=snps, dbSNP144_alleles, dbSNP155_alleles)
alleles <- data.table::setDT(alleles)
#remove where alleles missing for either
alleles <- alleles[!(is.na(dbSNP144_alleles) | is.na(dbSNP155_alleles))]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)==2,
        match:="both biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)>2,
        match:="both non-biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)==2,
        match:="dbSNP 144 non-biallelic"]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)>2,
        match:="dbSNP 155 non-biallelic"]
summ <- alleles[,.N,by=match]
summ[,prop:=N/sum(summ$N)]
#> summ
#match         N         prop
#1:          both biallelic 100605258 0.7427438640
#2: dbSNP 155 non-biallelic  32266661 0.2382168183
#3:      both non-biallelic   2565775 0.0189424855
#4: dbSNP 144 non-biallelic     13116 0.0000968322

Al-Murphy avatar Aug 08 '22 13:08 Al-Murphy

Posted on Biostars about this issue and got some replies too - https://www.biostars.org/p/9534249/#9534468

Generally, seems like people agree that the observed numbers are true

Al-Murphy avatar Aug 12 '22 07:08 Al-Murphy

I'm not conviced that dropping anything marked as multi-allelic in dbSNP is a sensible idea. As I understand it, the reasons for removing multi-allelic SNPs could be the following:

  1. The downstream tool cannot handle a table row defining a SNP with multiple alternate alleles.
  2. The downstream tool assumes all input is bi-allelic and calculates things such as allele frequency as 1 - effect_allele_frq or uses a statistical model that would have assumptions violated by multi-allelic variants.

For point 1, it could well be valid to simply split the row into two rows f.e.g:

POS   REF   ALT
123   A     G
123   A     T

(Although in some cases having duplicated genomic location could be an issue.)

For point 2, bi-allelic vs multi-allelic should only matter within the dataset. If a study sample includes only two alleles at a genomic location, then it doesn't matter that dbSNP has a third allele on record from a different dataset, as this doesn't affect anything within the study.

ilevantis avatar Sep 02 '22 15:09 ilevantis

I think those 2 points highly depend on the downstream analysis use case. I do agree though, I feel dropping non-bi-allelic SNPs is no longer a sensible option given the increasing numbers.

I believe the options of splitting multi-allelic SNPs where the ALT contains more than one alternative allele to multiple rows and removing multi-allelic SNPs where more than one of the alternative alleles are in the sumstats itself are both very sensible. If you would like to have a go at adding functionality to do these to MSS as alternative multi-allelic approaches and submit a pull request that would be great! I think it would really benefit the wider community (I'd be happy to add you to the list of contributors too).

Al-Murphy avatar Sep 02 '22 16:09 Al-Murphy

OpenGWAS report: using dbSNP155

Here's a sampling of 10,863 of the most well-powered GWAS in OpenGWAS.

Absolute number

nonbiallelic

The number of non-biallelic SNPs dropped ranged from 177 to 8.6 million (with a mean of 1.1 million) for a single GWAS.

Screenshot 2022-09-17 at 11 42 33

Percent

As a percentage of the total SNPs in each GWAS to begin with:

plot_zoom_png Screenshot 2022-09-17 at 11 45 54

The average percentage of non-biallelic SNPS was 47%, ranging from ~0 to 63%.

Session info

R Under development (unstable) (2022-02-25 r81808) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.5 LTS

Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale: [1] C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] rappdirs_0.3.3 rtracklayer_1.57.0 scattermore_0.8
[4] R.methodsS3_1.8.2 SeuratObject_4.1.1 tidyr_1.2.0
[7] ggplot2_3.3.6 bit64_4.0.5 knitr_1.40
[10] irlba_2.3.5 DelayedArray_0.23.1 R.utils_2.12.0
[13] data.table_1.14.2 rpart_4.1.16 doParallel_1.0.17
[16] KEGGREST_1.37.3 RCurl_1.98-1.8 generics_0.1.3
[19] BiocGenerics_0.43.1 GenomicFeatures_1.49.6 RhpcBLASctl_0.21-247.1
[22] cowplot_1.1.1 RSQLite_2.2.17 RANN_2.6.1
[25] proxy_0.4-27 future_1.28.0 bit_4.0.4
[28] webshot_0.5.3 spatstat.data_2.2-0 xml2_1.3.3
[31] httpuv_1.6.5 SummarizedExperiment_1.27.2 assertthat_0.2.1
[34] orthogene_1.3.2 viridis_0.6.2 gargle_1.2.1
[37] scNLP_0.1.0 xfun_0.32 hms_1.1.2
[40] TSP_1.2-1 babelgene_22.3 evaluate_0.16
[43] promises_1.2.0.1 fansi_1.0.3 restfulr_0.0.15
[46] progress_1.2.2 readxl_1.4.1 dendextend_1.16.0
[49] caTools_1.18.2 dbplyr_2.2.1 igraph_1.3.4
[52] DBI_1.1.3 htmlwidgets_1.5.4 sparsesvd_0.2-1
[55] spatstat.geom_2.4-0 stats4_4.2.0 purrr_0.3.4
[58] ellipsis_0.3.2 dplyr_1.0.10 ggpubr_0.4.0
[61] backports_1.4.1 gprofiler2_0.2.1 aod_1.3.2
[64] biomaRt_2.53.2 deldir_1.0-6 MatrixGenerics_1.9.1
[67] MungeSumstats_1.5.13 vctrs_0.4.1 SingleCellExperiment_1.19.0
[70] Biobase_2.57.1 here_1.0.1 ROCR_1.0-11
[73] abind_1.4-5 cachem_1.0.6 BSgenome_1.65.2
[76] progressr_0.11.0 sctransform_0.3.4 GenomicAlignments_1.33.1
[79] treeio_1.21.2 prettyunits_1.1.1 goftest_1.2-3
[82] cluster_2.1.4 ExperimentHub_2.5.0 ontologyIndex_2.10
[85] ape_5.6-2 lazyeval_0.2.2 crayon_1.5.1
[88] pkgconfig_2.0.3 GenomeInfoDb_1.33.5 seriation_1.3.6
[91] nlme_3.1-159 ewceData_1.5.0 rlang_1.0.5
[94] globals_0.16.1 lifecycle_1.0.2 miniUI_0.1.1.1
[97] registry_0.5-1 filelock_1.0.2 BiocFileCache_2.5.0
[100] AnnotationHub_3.5.0 cellranger_1.1.0 rprojroot_2.0.3
[103] polyclip_1.10-0 matrixStats_0.62.0 lmtest_0.9-40
[106] Matrix_1.4-1 aplot_0.1.6 carData_3.0-5
[109] boot_1.3-28 zoo_1.8-10 ggridges_0.5.3
[112] png_0.1-7 viridisLite_0.4.1 rjson_0.2.21
[115] rootSolve_1.8.2.3 phenomix_0.99.4 bitops_1.0-7
[118] R.oo_1.25.0 KernSmooth_2.23-20 Biostrings_2.65.3
[121] blob_1.2.3 stringr_1.4.1 parallelly_1.32.1
[124] spatstat.random_2.2-0 rstatix_0.7.0 MAGMA.Celltyping_2.0.6
[127] gridGraphics_0.5-1 S4Vectors_0.35.3 ggsignif_0.6.3
[130] scales_1.2.1 memoise_2.0.1 magrittr_2.0.3
[133] plyr_1.8.7 ica_1.0-3 gplots_3.1.3
[136] zlibbioc_1.43.0 compiler_4.2.0 BiocIO_1.7.1
[139] RColorBrewer_1.1-3 lme4_1.1-30 fitdistrplus_1.1-8
[142] homologene_1.4.68.19.3.27 Rsamtools_2.13.4 cli_3.4.0
[145] XVector_0.37.1 listenv_0.8.0 patchwork_1.1.2
[148] pbapply_1.5-0 mgcv_1.8-40 MASS_7.3-58.1
[151] tidyselect_1.1.2 stringi_1.7.8 yaml_2.3.5
[154] ggrepel_0.9.1 GeneOverlap_1.33.0 grid_4.2.0
[157] VariantAnnotation_1.43.3 lmom_2.9 tools_4.2.0
[160] future.apply_1.9.0 parallel_4.2.0 rstudioapi_0.14
[163] RNOmni_1.0.1 piggyback_0.1.4 foreach_1.5.2
[166] gridExtra_2.3 gld_2.6.5 Rtsne_0.16
[169] HGNChelper_0.8.1 digest_0.6.29 BiocManager_1.30.18
[172] rgeos_0.5-9 shiny_1.7.2 Rcpp_1.0.9
[175] GenomicRanges_1.49.1 car_3.1-0 broom_1.0.1
[178] BiocVersion_3.16.0 later_1.3.0 RcppAnnoy_0.0.19
[181] httr_1.4.4 ggdendro_0.1.23 AnnotationDbi_1.59.1
[184] Rdpack_2.4 colorspace_2.0-3 XML_3.99-0.10
[187] fs_1.5.2 tensor_1.5 reticulate_1.26
[190] IRanges_2.31.2 splines_4.2.0 uwot_0.1.14
[193] yulab.utils_0.0.5 expm_0.999-6 tidytree_0.4.0
[196] spatstat.utils_2.3-1 sp_1.5-0 Exact_3.1
[199] ggplotify_0.1.0 plotly_4.10.0 xtable_1.8-4
[202] jsonlite_1.8.0 nloptr_2.0.3 ggtree_3.5.3
[205] heatmaply_1.3.0 ggfun_0.0.7 R6_2.5.1
[208] EWCE_1.5.7 pillar_1.8.1 htmltools_0.5.3
[211] mime_0.12 glue_1.6.2 fastmap_1.1.0
[214] minqa_1.2.4 BiocParallel_1.31.12 class_7.3-20
[217] interactiveDisplayBase_1.35.0 codetools_0.2-18 mvtnorm_1.1-3
[220] utf8_1.2.2 lattice_0.20-45 spatstat.sparse_2.1-1
[223] tibble_3.1.8 pbkrtest_0.5.1 curl_4.3.2
[226] DescTools_0.99.46 leiden_0.4.2 gtools_3.9.3
[229] survival_3.4-0 limma_3.53.6 rmarkdown_2.16
[232] googleAuthR_2.0.0 munsell_0.5.0 e1071_1.7-11
[235] GenomeInfoDbData_1.2.8 iterators_1.0.14 variancePartition_1.27.3
[238] reshape2_1.4.4 gtable_0.3.1 rbibutils_2.2.9
[241] spatstat.core_2.4-4 Seurat_4.1.1

bschilder avatar Sep 17 '22 10:09 bschilder

Hi Murphy! I seem to face the same situation that a huge number of SNPs get removed for being non-biallelic. I was wondering if there is now an argument I can specify to keep the non-biallelic SNPs. Thanks in advance!

Lloyd-LiuSiyi avatar Nov 23 '22 05:11 Lloyd-LiuSiyi

Sorry for my ignorance, non-biallelic SNPs can be kept by specifying the bi_allelic_filter argument.

Lloyd-LiuSiyi avatar Nov 23 '22 05:11 Lloyd-LiuSiyi

Hi!

I'm, sorry for writing in this old issue, but I am looking for a solution for this and saw the conversation in the SNPlocsForge Repo, which was concluded some time ago, but I failed to understand the solution. I have a high number of non bi-allelic SNP dropped when using MungeSumtats, with all up-to-date packages from Bioconductor 3.18? Is the solution that this is simply due to higher coverage in more up-to-date dbSNP versions and the way forward would be to drop the few individuals with the 3+rd allele instead of removing the whole SNP?

Thanks for your help and the great work on this package!

Cheers!

cfbeuchel avatar Feb 02 '24 14:02 cfbeuchel

Hey,

using MungeSumtats, with all up-to-date packages from Bioconductor 3.18

Yes, this is the latest release version of MSS.

Is the solution that this is simply due to higher coverage in more up-to-date dbSNP versions and the way forward would be to drop the few individuals with the 3+rd allele instead of removing the whole SNP?

That's one solution, another one would be to keep non-bi-allelic SNPs (set bi_allelic_filter=FALSE when running format_sumstats()) - note this affects FRQ calculations if they need to be flipped etc. It really depends on what you plan on doing with the sumstats, some downstream tools like LDSC for example, will remove non-bi-allelic SNPs anyway (leaving just HapMap3 SNPs). Personally, I think you should keep non-bi-allelic SNPs unless you have a good reason to remove them.

Cheers, Alan.

Al-Murphy avatar Feb 02 '24 14:02 Al-Murphy