MungeSumstats
MungeSumstats copied to clipboard
Large number of non-biallelic SNPs
I'm munging a bunch of OpenGWAS data with MSS, and i noticed that the number of variants dropped due to being non-biallelic was quite high (hundreds of thousands-millions). In some cases, up to half the original SNPs are being dropped because they’re non-biallelic.
Reprex
gwas_paths <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
save_dir = here::here("data/GWAS_sumstats"),
nThread = 30, # >30 causes issues with read_vcf_parallel
parallel_across_ids = FALSE,
force_new_vcf = FALSE,
force_new = FALSE,
vcf_download = TRUE,
vcf_dir = here::here("data/VCFs"),
#### Record logs
log_folder_ind = TRUE,
log_mungesumstats_msgs = TRUE,
)
Example 1: ieu-a-8
In this case, a GWAS with 2.4M variants had 1.2M+ non-bilallelic variants removed: mss.txt
Example 2: Vuckovic2020
Here’s another where 5M/44M variant were considered non-biallelic: mss_vuckovic.txt
Literature
- Campbell2016: Notes that even at the higher estimates we’d only expect multi-allelic sites in 3% of SNPs.
- Phillips2020: Found only 0.3% of SNPs in 1KG Phase 3 were tri-allelic.
- Phillips2016: Found only 961 SNPs in 1KG Phase 3 were tetra-allelic.
Possible explanations
- MSS is over-attributing multi-allelic status to variants somehow,
- The updates to Bioc's dbSNP ref contains lots more reported multi-allelic sites that in previous versions.
- Something went awry when making the latest Bioc dbSNP dataset.
- This is the true number of multi-allelic variants and it is far above the estimates from single-cohort studies (e.g. 1KGp3) bc it incorporates info from across many different datasets.
To do
- [x] Double check relevant MSS code. @Al-Murphy
- [x] Compare number of multi-allelic SNPs in old vs new Bioc dbSNP datasets. @Al-Murphy
- [ ] Check the %/# non-ballelic variants that were filtered in previous versions of MSS. @bschilder
- [x] Check the %/# non-ballelic variants that were filtered using the latest version of MSS. @bschilder
@bschilder can you add the top 10 or so rows of the log file of the dropped non-biallelic SNPs from Example 1: ieu-a-8 or Example 2: Vuckovic2020?
@bschilder can you add the top 10 or so rows of the log file of the dropped non-biallelic SNPs from Example 1: ieu-a-8 or Example 2: Vuckovic2020?
Sure!
I have checked MSS code and it seems to work as I suspect it to, it appears that there are now just a lot more alternative alleles in bioconductor's dbSNP 155 vs 144. See the example below where dbSNP 144 finds no non-biallelic but 155 finds 48 from a total of 93:
#MSS showing a lot is non-biallelic dbSNP 155 vs 144
sumstats_dt <- MungeSumstats::formatted_example()
rsids <- MungeSumstats:::load_ref_genome_data(snps = sumstats_dt$SNP,
ref_genome = "GRCH38",
dbSNP=144)
rsids155 <- MungeSumstats:::load_ref_genome_data(snps = sumstats_dt$SNP,
ref_genome = "GRCH38",
dbSNP=155)
# get chars for SNPs not bi/tri allelic
# or strand ambig from IUPAC_CODE_MAP
nonambig_IUPAC_CODE_MAP <-
names(Biostrings::IUPAC_CODE_MAP[nchar(
Biostrings::IUPAC_CODE_MAP
) < 3])
data.table::setkey(rsids, SNP)
data.table::setkey(rsids155, SNP)
bad_ids <- rsids[!alleles_as_ambig %in% nonambig_IUPAC_CODE_MAP]
print(nrow(bad_ids))
bad_ids155 <- rsids155[!alleles_as_ambig %in% nonambig_IUPAC_CODE_MAP]
print(nrow(bad_ids155))
print(bad_ids155)
There is no github page for bioconductor's dbSNP 155 so I'll just message Herve to ask his opinion on what's going on
I have checked MSS code and it seems to work as I suspect it to, it appears that there are now just a lot more alternative alleles in bioconductor's dbSNP 155 vs 144. See the example below where dbSNP 144 finds no non-biallelic but 155 finds 48 from a total of 93:
Ok, well that at least narrows thing down. Thanks!
There is no github page for bioconductor's dbSNP 155 so I'll just message Herve to ask his opinion on what's going on
That would be great, thanks!
I randomly picked 3 SNPs from my example code above and checked them in NCBI dbSNP online portal (dbSNP 155) and they all appear to be non-biallelic as described:
I know this is by no means definitive but does seem like it is doing things correctly
Had a look into all the RS ID's from dbSNP 144 to check the proportion that are non-biallelic across the dbSNP 155 database and 144. I used an approach separate to MSS to ensure the issue isn't there. This approach was given by @hpages (see below).
It appears there is just a substantial number more non-biallelic SNPs in 155. This doesn't seem to be a MSS issue and just reflects the changes in the versions.
I'll post about this in https://github.com/hpages/SNPlocsForge in case anyone else has further insight but I believe the field's focus on biallelic SNPs may need to change as the number of non-biallelic SNPs increase. ~24% is a large proportion of SNPs to exclude.
I'll leave this issue open for now for others to see.
#wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144.txt.gz"
#gunzip snp144.txt.gz
snps <- limma::read.columns("~/Downloads/snp144.txt",
required.col=c("rs775809821"))$rs775809821
#missing header so add in SNP
snps <- c("rs775809821",snps)
library(BSgenome)
extractAlleles <- function(rsids, dbSNP, ref_genome="GRCh38")
{
pkgname <- paste0("SNPlocs.Hsapiens.dbSNP", dbSNP, ".", ref_genome)
pkgenv <- loadNamespace(pkgname)
snplocs <- get(pkgname, envir=pkgenv, inherits=FALSE)
snps <- snpsById(snplocs, rsids, ifnotfound="drop")
unname(setNames(IUPAC_CODE_MAP[mcols(snps)$alleles_as_ambig],
mcols(snps)$RefSNP_id)[rsids])
}
dbSNP144_alleles <- extractAlleles(snps, 144)
dbSNP155_alleles <- extractAlleles(snps, 155)
alleles <- data.frame(rsid=snps, dbSNP144_alleles, dbSNP155_alleles)
alleles <- data.table::setDT(alleles)
#remove where alleles missing for either
alleles <- alleles[!(is.na(dbSNP144_alleles) | is.na(dbSNP155_alleles))]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)==2,
match:="both biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)>2,
match:="both non-biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)==2,
match:="dbSNP 144 non-biallelic"]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)>2,
match:="dbSNP 155 non-biallelic"]
summ <- alleles[,.N,by=match]
summ[,prop:=N/sum(summ$N)]
#> summ
#match N prop
#1: both biallelic 100605258 0.7427438640
#2: dbSNP 155 non-biallelic 32266661 0.2382168183
#3: both non-biallelic 2565775 0.0189424855
#4: dbSNP 144 non-biallelic 13116 0.0000968322
Posted on Biostars about this issue and got some replies too - https://www.biostars.org/p/9534249/#9534468
Generally, seems like people agree that the observed numbers are true
I'm not conviced that dropping anything marked as multi-allelic in dbSNP is a sensible idea. As I understand it, the reasons for removing multi-allelic SNPs could be the following:
- The downstream tool cannot handle a table row defining a SNP with multiple alternate alleles.
- The downstream tool assumes all input is bi-allelic and calculates things such as allele frequency as
1 - effect_allele_frq
or uses a statistical model that would have assumptions violated by multi-allelic variants.
For point 1, it could well be valid to simply split the row into two rows f.e.g:
POS REF ALT
123 A G
123 A T
(Although in some cases having duplicated genomic location could be an issue.)
For point 2, bi-allelic vs multi-allelic should only matter within the dataset. If a study sample includes only two alleles at a genomic location, then it doesn't matter that dbSNP has a third allele on record from a different dataset, as this doesn't affect anything within the study.
I think those 2 points highly depend on the downstream analysis use case. I do agree though, I feel dropping non-bi-allelic SNPs is no longer a sensible option given the increasing numbers.
I believe the options of splitting multi-allelic SNPs where the ALT contains more than one alternative allele to multiple rows and removing multi-allelic SNPs where more than one of the alternative alleles are in the sumstats itself are both very sensible. If you would like to have a go at adding functionality to do these to MSS as alternative multi-allelic approaches and submit a pull request that would be great! I think it would really benefit the wider community (I'd be happy to add you to the list of contributors too).
OpenGWAS report: using dbSNP155
Here's a sampling of 10,863 of the most well-powered GWAS in OpenGWAS.
Absolute number
The number of non-biallelic SNPs dropped ranged from 177 to 8.6 million (with a mean of 1.1 million) for a single GWAS.
Percent
As a percentage of the total SNPs in each GWAS to begin with:
The average percentage of non-biallelic SNPS was 47%, ranging from ~0 to 63%.
Session info
Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale: [1] C
attached base packages: [1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 rtracklayer_1.57.0 scattermore_0.8
[4] R.methodsS3_1.8.2 SeuratObject_4.1.1 tidyr_1.2.0
[7] ggplot2_3.3.6 bit64_4.0.5 knitr_1.40
[10] irlba_2.3.5 DelayedArray_0.23.1 R.utils_2.12.0
[13] data.table_1.14.2 rpart_4.1.16 doParallel_1.0.17
[16] KEGGREST_1.37.3 RCurl_1.98-1.8 generics_0.1.3
[19] BiocGenerics_0.43.1 GenomicFeatures_1.49.6 RhpcBLASctl_0.21-247.1
[22] cowplot_1.1.1 RSQLite_2.2.17 RANN_2.6.1
[25] proxy_0.4-27 future_1.28.0 bit_4.0.4
[28] webshot_0.5.3 spatstat.data_2.2-0 xml2_1.3.3
[31] httpuv_1.6.5 SummarizedExperiment_1.27.2 assertthat_0.2.1
[34] orthogene_1.3.2 viridis_0.6.2 gargle_1.2.1
[37] scNLP_0.1.0 xfun_0.32 hms_1.1.2
[40] TSP_1.2-1 babelgene_22.3 evaluate_0.16
[43] promises_1.2.0.1 fansi_1.0.3 restfulr_0.0.15
[46] progress_1.2.2 readxl_1.4.1 dendextend_1.16.0
[49] caTools_1.18.2 dbplyr_2.2.1 igraph_1.3.4
[52] DBI_1.1.3 htmlwidgets_1.5.4 sparsesvd_0.2-1
[55] spatstat.geom_2.4-0 stats4_4.2.0 purrr_0.3.4
[58] ellipsis_0.3.2 dplyr_1.0.10 ggpubr_0.4.0
[61] backports_1.4.1 gprofiler2_0.2.1 aod_1.3.2
[64] biomaRt_2.53.2 deldir_1.0-6 MatrixGenerics_1.9.1
[67] MungeSumstats_1.5.13 vctrs_0.4.1 SingleCellExperiment_1.19.0
[70] Biobase_2.57.1 here_1.0.1 ROCR_1.0-11
[73] abind_1.4-5 cachem_1.0.6 BSgenome_1.65.2
[76] progressr_0.11.0 sctransform_0.3.4 GenomicAlignments_1.33.1
[79] treeio_1.21.2 prettyunits_1.1.1 goftest_1.2-3
[82] cluster_2.1.4 ExperimentHub_2.5.0 ontologyIndex_2.10
[85] ape_5.6-2 lazyeval_0.2.2 crayon_1.5.1
[88] pkgconfig_2.0.3 GenomeInfoDb_1.33.5 seriation_1.3.6
[91] nlme_3.1-159 ewceData_1.5.0 rlang_1.0.5
[94] globals_0.16.1 lifecycle_1.0.2 miniUI_0.1.1.1
[97] registry_0.5-1 filelock_1.0.2 BiocFileCache_2.5.0
[100] AnnotationHub_3.5.0 cellranger_1.1.0 rprojroot_2.0.3
[103] polyclip_1.10-0 matrixStats_0.62.0 lmtest_0.9-40
[106] Matrix_1.4-1 aplot_0.1.6 carData_3.0-5
[109] boot_1.3-28 zoo_1.8-10 ggridges_0.5.3
[112] png_0.1-7 viridisLite_0.4.1 rjson_0.2.21
[115] rootSolve_1.8.2.3 phenomix_0.99.4 bitops_1.0-7
[118] R.oo_1.25.0 KernSmooth_2.23-20 Biostrings_2.65.3
[121] blob_1.2.3 stringr_1.4.1 parallelly_1.32.1
[124] spatstat.random_2.2-0 rstatix_0.7.0 MAGMA.Celltyping_2.0.6
[127] gridGraphics_0.5-1 S4Vectors_0.35.3 ggsignif_0.6.3
[130] scales_1.2.1 memoise_2.0.1 magrittr_2.0.3
[133] plyr_1.8.7 ica_1.0-3 gplots_3.1.3
[136] zlibbioc_1.43.0 compiler_4.2.0 BiocIO_1.7.1
[139] RColorBrewer_1.1-3 lme4_1.1-30 fitdistrplus_1.1-8
[142] homologene_1.4.68.19.3.27 Rsamtools_2.13.4 cli_3.4.0
[145] XVector_0.37.1 listenv_0.8.0 patchwork_1.1.2
[148] pbapply_1.5-0 mgcv_1.8-40 MASS_7.3-58.1
[151] tidyselect_1.1.2 stringi_1.7.8 yaml_2.3.5
[154] ggrepel_0.9.1 GeneOverlap_1.33.0 grid_4.2.0
[157] VariantAnnotation_1.43.3 lmom_2.9 tools_4.2.0
[160] future.apply_1.9.0 parallel_4.2.0 rstudioapi_0.14
[163] RNOmni_1.0.1 piggyback_0.1.4 foreach_1.5.2
[166] gridExtra_2.3 gld_2.6.5 Rtsne_0.16
[169] HGNChelper_0.8.1 digest_0.6.29 BiocManager_1.30.18
[172] rgeos_0.5-9 shiny_1.7.2 Rcpp_1.0.9
[175] GenomicRanges_1.49.1 car_3.1-0 broom_1.0.1
[178] BiocVersion_3.16.0 later_1.3.0 RcppAnnoy_0.0.19
[181] httr_1.4.4 ggdendro_0.1.23 AnnotationDbi_1.59.1
[184] Rdpack_2.4 colorspace_2.0-3 XML_3.99-0.10
[187] fs_1.5.2 tensor_1.5 reticulate_1.26
[190] IRanges_2.31.2 splines_4.2.0 uwot_0.1.14
[193] yulab.utils_0.0.5 expm_0.999-6 tidytree_0.4.0
[196] spatstat.utils_2.3-1 sp_1.5-0 Exact_3.1
[199] ggplotify_0.1.0 plotly_4.10.0 xtable_1.8-4
[202] jsonlite_1.8.0 nloptr_2.0.3 ggtree_3.5.3
[205] heatmaply_1.3.0 ggfun_0.0.7 R6_2.5.1
[208] EWCE_1.5.7 pillar_1.8.1 htmltools_0.5.3
[211] mime_0.12 glue_1.6.2 fastmap_1.1.0
[214] minqa_1.2.4 BiocParallel_1.31.12 class_7.3-20
[217] interactiveDisplayBase_1.35.0 codetools_0.2-18 mvtnorm_1.1-3
[220] utf8_1.2.2 lattice_0.20-45 spatstat.sparse_2.1-1
[223] tibble_3.1.8 pbkrtest_0.5.1 curl_4.3.2
[226] DescTools_0.99.46 leiden_0.4.2 gtools_3.9.3
[229] survival_3.4-0 limma_3.53.6 rmarkdown_2.16
[232] googleAuthR_2.0.0 munsell_0.5.0 e1071_1.7-11
[235] GenomeInfoDbData_1.2.8 iterators_1.0.14 variancePartition_1.27.3
[238] reshape2_1.4.4 gtable_0.3.1 rbibutils_2.2.9
[241] spatstat.core_2.4-4 Seurat_4.1.1
Hi Murphy! I seem to face the same situation that a huge number of SNPs get removed for being non-biallelic. I was wondering if there is now an argument I can specify to keep the non-biallelic SNPs. Thanks in advance!
Sorry for my ignorance, non-biallelic SNPs can be kept by specifying the bi_allelic_filter
argument.
Hi!
I'm, sorry for writing in this old issue, but I am looking for a solution for this and saw the conversation in the SNPlocsForge Repo, which was concluded some time ago, but I failed to understand the solution. I have a high number of non bi-allelic SNP dropped when using MungeSumtats, with all up-to-date packages from Bioconductor 3.18? Is the solution that this is simply due to higher coverage in more up-to-date dbSNP versions and the way forward would be to drop the few individuals with the 3+rd allele instead of removing the whole SNP?
Thanks for your help and the great work on this package!
Cheers!
Hey,
using MungeSumtats, with all up-to-date packages from Bioconductor 3.18
Yes, this is the latest release version of MSS.
Is the solution that this is simply due to higher coverage in more up-to-date dbSNP versions and the way forward would be to drop the few individuals with the 3+rd allele instead of removing the whole SNP?
That's one solution, another one would be to keep non-bi-allelic SNPs (set bi_allelic_filter=FALSE
when running format_sumstats()
) - note this affects FRQ calculations if they need to be flipped etc. It really depends on what you plan on doing with the sumstats, some downstream tools like LDSC for example, will remove non-bi-allelic SNPs anyway (leaving just HapMap3 SNPs). Personally, I think you should keep non-bi-allelic SNPs unless you have a good reason to remove them.
Cheers, Alan.