coloc
coloc copied to clipboard
prepare the data structure for coloc.abf and loop for multiple variants running together
Hi, Thanks for creating this package. I tried to run colocalization with GWAS sumstats and mQTL data extracted from GoDMC dataset. However, there're some issues about the data structure to run coloc.abf with multiple snps together in a loop. Here're the details of my datasets and problematic outputs from coloc.abf
• GWAS summary statistics as data frame
- subset the GWAS data based on p value threshold p< 1x10-5
- Select the variables for coloc.abf . then the dataset as below:
'data.frame': 41882 obs. of 8 variables: $ snp : chr "rs573419147" "rs557642992" "rs7454868" "rs574909700" ... $ chr : int 17 17 6 6 6 6 6 6 6 6 ... $ position : int 44327370 44365374 26799828 28832788 28833101 28840885 28843311 28958399 28961100 29266673 ... $ maf.gwas : num 0.787 0.769 0.914 0.922 0.929 ... $ beta.gwas : num 1.07 1.07 1.25 1.25 1.28 ... $ SE : num 0.0149 0.0142 0.0205 0.0173 0.0179 ... $ p.gwas : num 8.07e-06 8.11e-06 2.80e-27 3.86e-38 1.96e-42 ... $ varbeta.gwas: num 0.000223 0.000201 0.000422 0.000299 0.000322 ...
• mQTL data from GoDMC
- Select the variables for coloc.abf , then the dataset as below:
data.frame': 13234 obs. of 8 variables: $ snp.mqtl : chr "rs7074491" "rs7534848" "rs7545236" "rs1783551" ... $ name : chr "chr10:134813720:SNP" "chr1:169530093:SNP" "chr1:169530070:SNP" "chr11:75231212:SNP" ... $ cpg : chr "cg00320980" "cg16054275" "cg16054275" "cg15526825" ... $ beta.mqtl : num -0.349 0.411 0.411 0.496 0.45 ... $ se.mqtl : num 0.0093 0.011 0.011 0.0132 0.012 ... $ samplesize : num 21983 27748 27748 25694 26642 ... $ p.mqtl : num 2.38e-308 3.17e-308 3.17e-308 3.71e-308 4.17e-308 ... $ varbeta.mqtl: num 8.64e-05 1.20e-04 1.20e-04 1.75e-04 1.44e-04 ...
• Merge two datasets into a single one
- eventually, 44 SNPs overlapped between gwas and mqtl datasets
• make lists of lists as required by coloc.abf () without loop function
D1= list(beta=df$beta.gwas, varbeta=df$varbeta.gwas, MAF=df$maf.gwas, type="cc", s=40675/105318, N=105318, snp=df$snp)
The D1 looks as below:
List of 7 $ beta : num [1:44] 1.12 1.05 1.12 1.13 1.12 ... $ varbeta: num [1:44] 1.45e-04 9.29e-05 1.47e-04 1.45e-04 1.47e-04 ... $ MAF : num [1:44] 0.811 0.559 0.812 0.806 0.809 ... $ type : chr "cc" $ s : num 0.386 $ N : num 105318 $ snp : chr [1:44] "rs1104871" "rs11223635" "rs11967852" "rs11967935"
D2 =list(beta=df$beta.mqtl, varbeta=df$varbeta.mqtl, MAF=df$maf.gwas, type="quant", N=27750, snp=df$snp)
List of 6 $ beta : num [1:44] 0.1109 -0.0869 0.1123 0.1096 0.1095 ... $ varbeta: num [1:44] 1.13e-04 7.64e-05 1.09e-04 1.10e-04 1.11e-04 ... $ MAF : num [1:44] 0.811 0.559 0.812 0.806 0.809 ... $ type : chr "quant" $ N : num 27750 $ snp : chr [1:44] "rs1104871" "rs11223635" "rs11967852" "rs11967935" .
coloc.abf(D1,D2)
nsnps PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 44 0.000000e+00 6.668867e-72 0.000000e+00 1.000000e+00 3.177685e-52 However, I want to run the analysis on each SNP, and gives ~ different results per SNPs. So I wrote a loop as below Set datalists with loop for coloc.abf () function datalist1<- lapply(1:length(df$snp), function(i) { d1[[i]] = list(snp =df$snp[i],beta=df$beta.gwas[i],varbeta=df$varbeta.gwas[i], MAF=df$maf.gwas[i], type="cc", s=40675/105318, N=105318) })
datalist2 <- lapply(1:length(df$snp), function(i) { d2[[i]] = list(snp =df$snp[i],beta=df$beta.mqtl[i],varbeta=df$varbeta.mqtl[i], MAF=df$maf.gwas[i], type="quant", N=27750) })
res <- lapply(1:length(df$snp),function(i) coloc.abf(datalist1[[i]],datalist2[[i]])$summary) do.call("rbind", res)
However, it seems that the code only run coloc with the same SNP per row
nsnps PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
[1,] 1 0 4.108072e-22 0 0 1 [2,] 1 0 7.421940e-20 0 0 1 [3,] 1 0 1.642475e-23 0 0 1 [4,] 1 0 3.604846e-22 0 0 1 [5,] 1 0 6.957833e-22 0 0 1 ....... [40,] 1 0 1.937698e-23 0 0 1 [41,] 1 0 1.613592e-23 0 0 1 [42,] 1 0 1.571302e-23 0 0 1 [43,] 1 0 1.703198e-23 0 0 1 [44,] 1 0 7.755652e-23 0 0 1
Not sure if the issue cause by the data structures (it seemed fine as check_dataset function returned NULL) before running coloc.abf, or the loop function. Could you please provide more details about how to prepare the right datalists to coloc.abf for multiple SNPs and how to run multiple variants together? Many thanks!