Robyn icon indicating copy to clipboard operation
Robyn copied to clipboard

Plotting uses huge amount of RAM, causes crash, Out of Memory Errors

Open MC-Dave opened this issue 1 year ago • 13 comments

Project Robyn

Describe issue

robyn_outputs is consuming a huge amount of RAM during plotting. We need the csv outputs, but have no use for the plots. It appears that robyn_outputs is consuming a massive amount of RAM while it is carrying out plotting, eg "Plotting X selected models on Y cores" The RAM usage is causing failures for us in our production systems. We have not found any way to disable plotting outputs, while also preserving CSV outputs. one can only choose to enable both csv outputs and plotting or disable them both via the "export" parameter.

Note: we currently limit our instances to use only 1/5th of the systems total cores. So if it has 20 cores, we only use 4.

On a system with 192GB of RAM and 48 cores, we use 9 cores. During training the system only uses ~2% of available RAM. Right at the end of the process, just after it prints "Plotting X selected models on Y cores" it jumps to 100% utilization and crashes the execution.

Provide reproducible example

The issue is transient. Re-running a failed execution with the exact same inputs and datasets will succeed sometimes and fail others.

Environment & Robyn version

ROBYN VERSION R@fb3688a9ee9fe3a7836e6fea1ad386080a3fb00c Installed via remotes::install_github("facebookexperimental/Robyn/R@fb3688a9ee9fe3a7836e6fea1ad386080a3fb00c")

R Version 4.3.2

MC-Dave avatar Nov 10 '23 15:11 MC-Dave

Note: We only encounter this issue when running refresh jobs. When running a full train @ 5 trials, 5000 iterations, we never encounter the error. When run a refresh @ 5 trials, 5000 iterations, this issue occurs intermittently.

MC-Dave avatar Nov 14 '23 15:11 MC-Dave

Same issue here, cannot figure out the pattern of when it succeeds or fails.

ToddMinerTech avatar Nov 17 '23 15:11 ToddMinerTech

Sorry for the late reply, you can set the arg plot_pareto = FALSE in robyn_outputs() to deactivate the pngs, see ?robyn_outputs. We'll look into the root cause in the future, but probably not very soon.

gufengzhou avatar Nov 30 '23 06:11 gufengzhou

I'm also being affected by this issue.

Worth noting I already use plot_pareto = FALSE when calling Robyn::robyn_refresh @gufengzhou

Logs:

>>> Recreating model 2_131_3
Imported JSON file succesfully: RobynModel-2_131_3.json
>> Running feature engineering...
Input data has 760 days in total: 2020-11-01 to 2022-11-30
Refresh #3 model is built on rolling window of 700 day: 2020-12-13 to 2022-11-12
Rolling window moving forward: 4 days
>>> Calculating response curves for all models' media variables (14)...
Successfully recreated model ID: 2_131_3
>>> Building refresh model #4 in manual mode
>>> New bounds freedom: 0.57%
>> Running feature engineering...
Input data has 760 days in total: 2020-11-01 to 2022-11-30
Refresh #4 model is built on rolling window of 700 day: 2020-12-17 to 2022-11-16
Rolling window moving forward: 4 days
Fitting time series with all available data...
Using geometric adstocking with 53 hyperparameters (52 to iterate + 1 fixed) on 7 cores
>>> Starting 3 trials with 1000 iterations each using TwoPointsDE nevergrad algorithm...
  Running trial 1 of 3

  |
  |======================================================================|  99%

  Finished in 1.09 mins
  Running trial 2 of 3

  |
  |======================================================================|  99%

  Finished in 1.13 mins
  Running trial 3 of 3

  |
  |======================================================================|  99%

  Finished in 1.29 mins
>>> Running Pareto calculations for 3000 models on auto fronts...
Killed

richin13 avatar Feb 01 '24 14:02 richin13

Hello! Any updates here? This keeps being an issue and increasing the task resources is not feasible (as OP is experiencing the same error on a 192GB RAM system)

richin13 avatar May 02 '24 13:05 richin13

I've been using our standard dataset and can't really reproduce this issue. Although I've heard from multiple sources that Windows users are seeing this more frequently. It does consume more memory in the outputs/ plotting functions, regardless of refresh. I'm trying to test it on a larger dataset. Will report back.

gufengzhou avatar May 03 '24 08:05 gufengzhou

So far I couldn't reproduce the issue. I'm on a Mac M1 Pro. I've just tested with 15 media vars, using weibull for more hyperparameters and ran it on 5k * 4 iterations, then refreshed it on 2k * 4 iters. It ran through.

@richin13 what machine/ system are you using?

gufengzhou avatar May 07 '24 03:05 gufengzhou

@gufengzhou we're running in AWS ECS Fargate, but I was able to reproduce in my local system (Linux Ubuntu 22.04, 16 GB of RAM and a 11th Gen Intel© Core™ i5-1135G7 @ 2.40GHz × 4)

We're running Robyn 3.10.3 though and I'm working on upgrading to 3.10.5 to see if that resolves it (maybe you're running 3.10.5 and that confirms the leak is fixed?)

richin13 avatar May 07 '24 15:05 richin13

I'm running on the latest 3.10.7. Please try and let me know.

gufengzhou avatar May 08 '24 02:05 gufengzhou

A bit of a tangent but are you guys planning on cutting the 3.10.7 release any time soon? This is a prod system so I'd be hesitant to install the version of master and we usually rely on whatever is published on the github releases page as we assume those are considered stable but I'm not seeing 3.10.7 there

richin13 avatar May 08 '24 17:05 richin13

@gufengzhou it seems like bumping to 3.10.5 fixes the memory leak as the process does not get killed anymore. However I'm now getting a different error later in the process:

>>> Calculating clusters for model selection using Pareto fronts...
Couldn't automatically create clusters: Error: empty cluster: try a better set of initial centers
Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "NULL"
In addition: Warning messages:
1: In robyn_chain(json_file) :
  Can't replicate chain-like results if you don't follow Robyn's chain structure
2: In prophet_decomp(dt_transform, dt_holidays = InputCollect$dt_holidays,  :
  Currently, there's a known issue with prophet that may crash this use case.
 Read more here: https://github.com/facebookexperimental/Robyn/issues/472
3: In hyper_collector(InputCollect, hyper_in = InputCollect$hyperparameters,  :
  Provided train_size but ts_validation = FALSE. Time series validation inactive.
Error in clusterCollect$data : $ operator is invalid for atomic vectors
Calls: main ... same_src -> same_src.data.frame -> is.data.frame -> select
Execution halted

My guess is that, since these models were created with 3.10.3 they no longer can be refreshed using 3.10.5? Is that the case? Is there anything we can do to those models to be able to refresh them using 3.10.5? Thanks

richin13 avatar May 08 '24 18:05 richin13

good to know the memory issue is gone. I'd strongly recommend to update to 3.10.7 that's been stable for most use cases and includes numerous fixes for refresh, incl. the chain error, the ts_validation error etc. We'll push this version to the github release page as well as CRAN in few weeks.

@laresbernardo FYI the latest versions don't cause memory issue anymore on AWS apparently

gufengzhou avatar May 09 '24 03:05 gufengzhou

Will do, thanks!

richin13 avatar May 09 '24 13:05 richin13