ludwig icon indicating copy to clipboard operation
ludwig copied to clipboard

best practice to tune configs to avoid OOM issue

Open Jeffwan opened this issue 2 years ago • 3 comments

I created an issue https://github.com/ludwig-ai/ludwig/issues/2154 in the past and get some clues. When I get my hands dirty, I feel it's still hard.

I tried to read files remotely and notice ray dataset consumes lots of memory and may lead to OOM issue. Currently, each trial consumes 1 cpu. then total cpu available decides the parallelism. However, in this case, it's hard to estimate the memory usage. compute pod (I use kubernetes) will be killed due to OOM issue easily.

Seems we can limit the parallelism by setting hyperopt.executor.max_concurrent_trials but the side effect is the rest CPU would be idle?

I am using CPU to process and train tabular data. Is there an easy formula to calculate the resource estimation?

Here's one case I meet OOM issue resource: cpu 8 core, memory 8G. This is an example to read rotten_tomatoes.csv. (10M..)

image

image

Jeffwan avatar Sep 15 '22 04:09 Jeffwan

Hey @Jeffwan, do you happen to know where in the Ludwig process the error occurred? In the screenshot I didn't see any stack from the Ludwig side.

8 CPUs and 8GiB per node is pretty low. I typically run with about 3-4GiB RAM per CPU core. If you do need to run with this configuration of pod resources, you can try restricting the number of trials per node by increasing the cpu_resources_per_trial to something like 4, so every trial has about half the RAM on the cluster to use. In Ludwig, this can be specified in the config as:

hyperopt:
  executor:
    cpu_resources_per_trial: 4

We are working on auto-detecting the appropriate resource requests for the job as part of some work we're doing on right-sizing jobs to Ray clusters. I'll loop in some of the folks working on that project for visibility.

@ShreyaR @geoffreyangus @jeffreyftang @jeffkinnison

tgaddair avatar Sep 16 '22 16:09 tgaddair

do you happen to know where in the Ludwig process the error occurred?

It failed in the ray tune phase, the Ludwig process can not allocate more memory. I gradually increase the limit to 3 GiB per worker and eventually succeed. (I use super small dataset, I didn't expect ray dataset in this case consume that many memory, dataset is 10MiB, it normally just consumes ~5x memory of the dataset in other cases we worked on) auto-detecting feature would be super helpful.

Jeffwan avatar Sep 17 '22 06:09 Jeffwan

actually this issue i ran into it , its from rays side

xxivk avatar Sep 12 '23 22:09 xxivk