doAzureParallel icon indicating copy to clipboard operation
doAzureParallel copied to clipboard

Warning message: In quit... system call failed: Cannot allocate memory

Open ctlamb opened this issue 6 years ago • 15 comments

Some of my nodes are failing with this error:

Warning message: In quit(save = "yes", status = workerErrorStatus, runLast = FALSE) : system call failed: Cannot allocate memory

Does this mean I need a CPU with more memory?

ctlamb avatar Nov 06 '18 18:11 ctlamb

Hi @ctlamb

Yes, this means you will need a CPU with more memory. I suggest measuring the memory usage for each task so you have a benchmark of what Azure VM to use.

Thanks Brian

brnleehng avatar Nov 07 '18 03:11 brnleehng

Excellent, will do. In the meantime, I tried to use a machine with slightly more memory ("vmSize" = "Standard_E4_v3"),, but I am running into the following error after I run foreach (this error doesn't occur with "vmSize" = "Standard_DS12_v2")

##Error: No automatic parser available for 7b/.

ctlamb avatar Nov 07 '18 14:11 ctlamb

What region are you in? It could be possible that Standard_E4_v3 is not available in your region. Is this happening during makeCluster?

brnleehng avatar Nov 07 '18 20:11 brnleehng

I'm in West US. The error throws in foreach

It looks like my tasks only use a max of 8GB of RAM, so the 28GB of ram I had in the "Standard_DS12_v2" should've been plenty. hmmmm. Not sure what's going on here

Memory usage readout


                                                                                                            Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1                                                                             doAzureParallel::setCredentials(credentials)            0.005                0.0               0.0
2                                                                                     mod<-mod.files$FilePath[bp$model[i]]            0.000                0.0               0.0
3                                                                                       tile<-r.files$FilePath[bp$tile[i]]            0.000                0.0               0.0
4      doAzureParallel::getStorageFile(container="occmodels",blobPath=paste0(mod),downloadPath=paste0(mod),overwrite=TRUE)           49.721              190.6             190.6
5                                                                                                brt<-readRDS(paste0(mod))           10.233              665.8             665.8
6  doAzureParallel::getStorageFile(container="rastertiles",blobPath=paste0(tile),downloadPath=paste0(tile),overwrite=TRUE)          496.358             1996.0            1996.0
7                                                     unzip(paste0(tile),exdir=here::here(),junkpaths=TRUE,overwrite=TRUE)           27.612                0.0               0.0
8                                                    raster_data<-list.files(here::here(),pattern=".tif$",full.names=TRUE)            0.150                0.0               0.0
9                                                                                        STACK<-raster::stack(raster_data)            2.337                0.3               6.0
10                                                  STACK[["CutBlock_Occurrence"]]<-ratify(STACK[["CutBlock_Occurrence"]])            5.092                0.0            1161.7
11                                                                        STACK[["Fire_Occ"]]<-ratify(STACK[["Fire_Occ"]])            5.012                0.0            1161.7
12                                                                          STACK[["CRDP_LC"]]<-ratify(STACK[["CRDP_LC"]])            5.132                0.0            1161.7
13                                                                        STACK[["MODIS_LC"]]<-ratify(STACK[["MODIS_LC"]])            4.990                0.0            1161.7
14                                         pred<-dismo::predict(STACK,brt,n.trees=brt$gbm.call$best.trees,type="response")        22156.271              387.8            8056.5
15                                                                                                            return(pred)            0.000                0.0               0.0

ctlamb avatar Nov 07 '18 23:11 ctlamb

Are you setting maxTasksPerNode greater than 1 in your cluster configuration?

brnleehng avatar Nov 08 '18 16:11 brnleehng

No it's =1

clusterConfig <- list( "name" = "LambRaster", "vmSize" = "Standard_DS12_v2", "maxTasksPerNode" = 1, "poolSize" = list( "dedicatedNodes" = list( "min" = 1, "max" = 200 ), "lowPriorityNodes" = list( "min" = 0, "max" = 0 ), "autoscaleFormula" = "QUEUE" ), "containerImage" = "rocker/geospatial:latest", "rPackages" = list( "cran" = c("doParallel", "here", "dismo", "gbm", "snow"), "github" = c("Azure/doAzureParallel"), "bioconductor" = c() ), "commandLine" = list() )

ctlamb avatar Nov 08 '18 16:11 ctlamb

Is there a better/preferred package I could use to measure the memory usage?

ctlamb avatar Nov 08 '18 16:11 ctlamb

Now getting Error: No automatic parser available for 7b/. even when I use the D12 machine now. ugghh, always hard to trouble shoot one issue (memory) when another pops up. Any thoughts? I could start a new thread if its easier

ctlamb avatar Nov 08 '18 23:11 ctlamb

I don't have a preferred package for measuring memory usage. Where exactly is this error occuring? Is this when the foreach is getting results?

If you have a cluster configuration file and a reproducible sample, I will work on identifying the issue

brnleehng avatar Nov 09 '18 21:11 brnleehng

This is the same as issue #315. I've spent many an hour pulling my hair out over this issue and I've no idea what's causing it. I've provided a lot of qualitative information in #315 but haven't had time to build a fully reproducible example at the scale which I think is generating the error.

@ctlamb is your workflow using resource files uploaded to Azure storage? My workflow is and I haven't been able to determine whether the 7b error still occurs when not using resource files. I'd like to attempt to rule out whether resource files could be contributing in some way.

simon-tarr avatar Nov 16 '18 16:11 simon-tarr

Yes I am uploading and downloading data to Azure storage in my workflow. I do wonder if this was an internet issue? My internet speed was recently upgraded, and I haven't got the 7b error since..but thats only based on 5-10 different tries so far. Will update if anything changes

ctlamb avatar Nov 16 '18 16:11 ctlamb

Yes I am uploading and downloading data to Azure storage in my workflow. I do wonder if this was an internet issue? My internet speed was recently upgraded, and I haven't got the 7b error since..but thats only based on 5-10 different tries so far. Will update if anything changes

Thanks for the extra information. My latest post at #315 documents the return of the dreaded 7b error.

I considered your idea here as well. However, my university network is a gigabit connection and it's rock stable. My home internet is a 100Mb fire connection which is also super reliable (for the most part).

I wonder if there's a limit to the number of connections Batch/HTTR can accept from a single IP address? I'm currently running 2 pools on my laptop (home network) and three on my uni workstation all day and they've been stable all day. If I try and run any more pools than this on either machine, the 7b error will return almost instantly. It's very strange...

simon-tarr avatar Nov 16 '18 16:11 simon-tarr

Are all of your workflows in interactive mode? (Waiting for the job to be done)

Thanks, Brian

brnleehng avatar Nov 16 '18 17:11 brnleehng

Mine is, yes.

simon-tarr avatar Nov 16 '18 17:11 simon-tarr

Any news on the status of this error? It's still happening to me with frustrating regularity.

Thanks!

simon-tarr avatar Dec 10 '18 11:12 simon-tarr