doAzureParallel icon indicating copy to clipboard operation
doAzureParallel copied to clipboard

Error generating azbatchenv rds file

Open fermumen opened this issue 3 years ago • 3 comments

I got a bit a weird error trying to run some code in azure batch that was working correctly on regular doParallel. This is the job's stderr

running '/usr/local/lib/R/bin/R --no-echo --no-restore --no-save --no-environ --no-restore --no-site-file --file=/mnt/batch/tasks/workitems/job20210326153929/job-1/jobpreparation/wd/worker.R --args 10 10 0 pass' . Error in readRDS(paste0(batchJobPreparationDirectory, "/", batchJobEnvironment)) : error reading from connection Execution halted

I've downloaded the job.rds from Azure Blob Storage and indeed I can't read it on my computer either. How could I troubleshoot this?

R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS . Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 . locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
. attached base packages: [1] parallel stats graphics grDevices utils datasets methods
[8] base
. other attached packages: [1] mlflow_1.14.0 AzureBatchUtils_0.1.0 doParallel_1.0.16
[4] iterators_1.0.13 foreach_1.5.1 yardstick_0.0.7
[7] workflows_0.2.1 tune_0.1.3 tidyr_1.1.2
[10] tibble_3.1.0 rsample_0.0.9 recipes_0.1.15
[13] purrr_0.3.4 parsnip_0.1.5 modeldata_0.1.0
[16] infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.4
[19] dials_0.0.9 scales_1.1.1 broom_0.7.5
[22] tidymodels_0.1.2
. loaded via a namespace (and not attached): [1] bitops_1.0-6 lubridate_1.7.9.2 DiceDesign_1.9
[4] httr_1.4.2 tools_4.0.4 backports_1.2.1
[7] utf8_1.1.4 R6_2.5.0 rpart_4.1-15
[10] DBI_1.1.0 colorspace_2.0-0 nnet_7.3-15
[13] withr_2.3.0 tidyselect_1.1.0 processx_3.4.5
[16] curl_4.3 compiler_4.0.4 cli_2.3.1
[19] swagger_3.33.1 forge_0.2.0 askpass_1.1
[22] stringr_1.4.0 digest_0.6.27 ini_0.3.1
[25] base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.5.0
[28] parallelly_1.22.0 lhs_1.1.1 fastmap_1.0.1
[31] rlang_0.4.10 doAzureParallel_0.8.0 rstudioapi_0.13
[34] shiny_1.5.0 generics_0.1.0 hwriter_1.3.2
[37] jsonlite_1.7.2 RCurl_1.98-1.3 magrittr_2.0.1
[40] Matrix_1.3-2 Rcpp_1.0.5 munsell_0.5.0
[43] fansi_0.4.1 GPfit_1.0-8 reticulate_1.18
[46] lifecycle_0.2.0 furrr_0.2.2 stringi_1.5.3
[49] yaml_2.2.1 pROC_1.16.2 snakecase_0.11.0
[52] MASS_7.3-53.1 plyr_1.8.6 grid_4.0.4
[55] listenv_0.8.0 promises_1.1.1 crayon_1.3.4
[58] lattice_0.20-41 splines_4.0.4 zeallot_0.1.0
[61] ps_1.5.0 pillar_1.5.0 ranger_0.12.1
[64] uuid_0.1-4 Rserve_1.8-7 rjson_0.2.20
[67] codetools_0.2-18 glue_1.4.2 rAzureBatch_0.7.0
[70] data.table_1.13.4 vctrs_0.3.5 httpuv_1.5.4
[73] gtable_0.3.0 openssl_1.4.3 future_1.21.0
[76] assertthat_0.2.1 TeachingDemos_2.10 gower_0.2.2
[79] mime_0.9 prodlim_2019.11.13 xtable_1.8-4
[82] later_1.1.0.1 class_7.3-18 survival_3.2-7
[85] timeDate_3043.102 SparkR_3.1.0 lava_1.6.8.1
[88] globals_0.14.0 ellipsis_0.3.1 hwriterPlus_1.0-3
[91] ipred_0.9-9

fermumen avatar Mar 26 '21 16:03 fermumen

I've tried the same code with just a subset of the data (~10%) and it seems to work correctly. Is there a limit on how much data can be uploaded to storage from doAzureParallel?

fermumen avatar Mar 30 '21 10:03 fermumen

Hi @fermumen,

Does the foreach loop finish without any errors? Also are you using error handling option?

Thanks, Brian

brnleehng avatar Apr 02 '21 17:04 brnleehng

Hi, all the jobs finish with errors but I think in the job preparation stage. I have tried filtering the dataframe to ~60% of the size with different random samples and it works as it should, it's only when I use the full dataset (~900k observations) that it fails. The code I'm running is a tune grid which implements %dopar%

library(doAzureParallel)
cl <- make_azbatch_cluster("rf_pool3", cran_libraries = c("ranger", "tidymodels"),
                           CPU = 4, tasks_per_node = 1,
                           low_priority_nodes = list(min = 25,
                                                     max = 25))
registerDoAzureParallel(cl)
esc_grid_results <- esc_workflow %>%
  tune_grid(resamples, # %dopar%
            grid = esc_grid,
            control = tune::control_grid(verbose = TRUE,
                                         parallel_over = "everything"))


stopCluster(cl)

Maybe I can try to generate a randomised example for you to reproduce.

fermumen avatar Apr 07 '21 10:04 fermumen