MungeSumstats
MungeSumstats copied to clipboard
`cannot find an open port. For manually specifying the port, see ?SnowParamUsing previously downloaded VCF.`
1. Bug description
import_sumstats
: Works fine for 100s of GWAS, then encounters this error and quickly iterates through all remaining GWAS ids without actually processing them (and, strangely, appends their log files to that of the one that first encountered the error!).
This takes a very long time to actually reproduce (multiple days of running continuously). And it's not like the GWAS that was being analyzed at the time of the error was particular large or anything ("only" 11M SNPs).
Possible explanations
- Multiple users on our private cloud are accidentally trying to use the same threads at the same time, and
BiocParallel
can't handle this gracefully? - The virtual machine becomes temporarily disconnected from its dedicated resources. Perhaps a question for @eduff
-
data.table
is trying to run in parallel within each loop ofread_vcf_parallel
(which is also being run in parallel), causing a conflict with the same cores being requested for different tasks at once. Though I don't know why this wouldn't happen far earlier when processing 100s of GWAS.
read_vcf_parallel
:
It seems to occur at read_vcf_parallel
. This function seems to be rather finicky as it also doesn't like it when I specify >30 threads, though I suspect that's for a different reason (splitting a VCF across too many threads means that if some genome tiles are empty, the whole loop breaks, perhaps at the final re-merging step).
Related Issues
BiocParallel
:
- https://github.com/Bioconductor/BiocParallel/pull/187
- https://github.com/Bioconductor/BiocParallel/issues/85
- https://github.com/Bioconductor/BiocParallel/issues/106
Also, not sure if I'm the only one, but BiocParallel
can be a bit trickier to use successfully.
Console output
Using local VCF.
File already tabix-indexed.
Finding empty VCF columns based on first 10,000 rows.
Dropping 1 duplicate columns.
1 sample detected: ubm-a-129
Constructing ScanVcfParam object.
VCF contains: 11,734,353 variant(s) x 1 sample(s)
Reading VCF file: multi-threaded (30 threads)
failed to open the port 11221, trying a new port...
failed to open the port 11596, trying a new port...
failed to open the port 11982, trying a new port...
failed to open the port 11329, trying a new port...
failed to open the port 11700, trying a new port...
cannot find an open port. For manually specifying the port, see ?SnowParamUsing previously downloaded VCF.
Formatted summary statistics will be saved to ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-81/ubm-a-81.tsv.gz
Log data to be saved to ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-81/logs
Saving output messages to:
/shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-81/logs/MungeSumstats_log_msg.txt
Any runtime errors will be saved to:
/shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-81/logs/MungeSumstats_log_output.txt
Messages will not be printed to terminal.
all connections are in useUsing previously downloaded VCF.
Formatted summary statistics will be saved to ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-93/ubm-a-93.tsv.gz
Log data to be saved to ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-93/logs
Saving output messages to:
/shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-93/logs/MungeSumstats_log_msg.txt
Any runtime errors will be saved to:
/shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-93/logs/MungeSumstats_log_output.txt
Messages will not be printed to terminal.
...
...
...
Full logs file: ubm-a-129_log_msg.txt
Expected behaviour
Process all sumstats.
2. Reproducible example
Code
meta <- MungeSumstats::find_sumstats(subcategories = c("neurological","Immune","cardio"))
gwas_paths <- MungeSumstats::import_sumstats(
ids = meta$id[1:400],
save_dir = here::here("data/GWAS_sumstats"),
nThread = 30, # >30 causes issues with read_vcf_parallel
parallel_across_ids = FALSE,
force_new_vcf = FALSE,
force_new = FALSE,
vcf_download = TRUE,
vcf_dir = here::here("data/VCFs"),
### axel will keep trying forever if the URL doesn't exist (or is private)
# download_method = "axel",
#### Record logs
log_folder_ind = TRUE,
log_mungesumstats_msgs = TRUE,
)
3. Session info
R Under development (unstable) (2022-02-25 r81808)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomeInfoDb_1.33.3 IRanges_2.31.0 S4Vectors_0.35.1 BiocGenerics_0.43.1
[5] dplyr_1.0.9 ggplot2_3.3.6 data.table_1.14.2 MungeSumstats_1.5.5
[9] MAGMA.Celltyping_2.0.6
loaded via a namespace (and not attached):
[1] utf8_1.2.2 R.utils_2.12.0
[3] tidyselect_1.1.2 lme4_1.1-30
[5] RSQLite_2.2.15 AnnotationDbi_1.59.1
[7] htmlwidgets_1.5.4 grid_4.2.0
[9] BiocParallel_1.31.10 munsell_0.5.0
[11] codetools_0.2-18 withr_2.5.0
[13] colorspace_2.0-3 Biobase_2.57.1
[15] filelock_1.0.2 knitr_1.39
[17] rstudioapi_0.13 orthogene_1.3.1
[19] SingleCellExperiment_1.19.0 ggsignif_0.6.3
[21] MatrixGenerics_1.9.1 GenomeInfoDbData_1.2.8
[23] bit64_4.0.5 rprojroot_2.0.3
[25] vctrs_0.4.1 treeio_1.21.0
[27] generics_0.1.3 xfun_0.31
[29] BiocFileCache_2.5.0 R6_2.5.1
[31] bitops_1.0-7 cachem_1.0.6
[33] gridGraphics_0.5-1 DelayedArray_0.23.1
[35] assertthat_0.2.1 BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
[37] promises_1.2.0.1 BiocIO_1.7.1
[39] scales_1.2.0 gtable_0.3.0
[41] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22 SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.22
[43] rlang_1.0.4 splines_4.2.0
[45] rtracklayer_1.57.0 rstatix_0.7.0
[47] lazyeval_0.2.2 gargle_1.2.0
[49] broom_1.0.0 BiocManager_1.30.18
[51] yaml_2.3.5 reshape2_1.4.4
[53] abind_1.4-5 GenomicFeatures_1.49.5
[55] backports_1.4.1 httpuv_1.6.5
[57] tools_4.2.0 ggplotify_0.1.0
[59] ellipsis_0.3.2 ggdendro_0.1.23
[61] Rcpp_1.0.9 plyr_1.8.7
[63] progress_1.2.2 zlibbioc_1.43.0
[65] purrr_0.3.4 RCurl_1.98-1.8
[67] prettyunits_1.1.1 ggpubr_0.4.0
[69] GenomicFiles_1.33.1 BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000
[71] SummarizedExperiment_1.27.1 fs_1.5.2
[73] here_1.0.1 magrittr_2.0.3
[75] matrixStats_0.62.0 hms_1.1.1
[77] patchwork_1.1.1 mime_0.12
[79] evaluate_0.15 xtable_1.8-4
[81] XML_3.99-0.10 EWCE_1.5.5
[83] gridExtra_2.3 compiler_4.2.0
[85] biomaRt_2.53.2 tibble_3.1.8
[87] crayon_1.5.1 minqa_1.2.4
[89] R.oo_1.25.0 htmltools_0.5.3
[91] ggfun_0.0.6 later_1.3.0
[93] tidyr_1.2.0 aplot_0.1.6
[95] DBI_1.1.3 ExperimentHub_2.5.0
[97] gprofiler2_0.2.1 dbplyr_2.2.1
[99] MASS_7.3-58 rappdirs_0.3.3
[101] boot_1.3-28 babelgene_22.3
[103] Matrix_1.4-1 car_3.1-0
[105] cli_3.3.0 R.methodsS3_1.8.2
[107] parallel_4.2.0 SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20
[109] GenomicRanges_1.49.0 pkgconfig_2.0.3
[111] SNPlocs.Hsapiens.dbSNP144.GRCh38_0.99.20 GenomicAlignments_1.33.1
[113] plotly_4.10.0 xml2_1.3.3
[115] ggtree_3.5.1 XVector_0.37.0
[117] yulab.utils_0.0.5 stringr_1.4.0
[119] VariantAnnotation_1.43.2 digest_0.6.29
[121] Biostrings_2.65.1 rmarkdown_2.14
[123] HGNChelper_0.8.1 tidytree_0.3.9
[125] restfulr_0.0.15 curl_4.3.2
[127] shiny_1.7.2 Rsamtools_2.13.3
[129] rjson_0.2.21 nloptr_2.0.3
[131] lifecycle_1.0.1 nlme_3.1-158
[133] jsonlite_1.8.0 carData_3.0-5
[135] viridisLite_0.4.0 limma_3.53.5
[137] BSgenome_1.65.2 fansi_1.0.3
[139] pillar_1.8.0 lattice_0.20-45
[141] homologene_1.4.68.19.3.27 KEGGREST_1.37.3
[143] fastmap_1.1.0 httr_1.4.3
[145] googleAuthR_2.0.0 interactiveDisplayBase_1.35.0
[147] glue_1.6.2 RNOmni_1.0.0
[149] png_0.1-7 ewceData_1.5.0
[151] BiocVersion_3.16.0 bit_4.0.4
[153] stringi_1.7.8 blob_1.2.3
[155] AnnotationHub_3.5.0 memoise_2.0.1
[157] ape_5.6-2