mljar-supervised icon indicating copy to clipboard operation
mljar-supervised copied to clipboard

Memory usege during training

Open RafaD5 opened this issue 4 years ago • 13 comments

Hi,

I've trained several models with mode="Perform" and when the training gets to certain point the python execution is killed because of the memory usage (I'm using a computer with 16 GB). What I do is to rerun the script and change the model_name to the name of the model just created to resume training. A couple of times I've had to repeat this process twice. It is not due to a single model but to data from previous models (already trained) that is not eliminated from memory. image

RafaD5 avatar Apr 20 '21 02:04 RafaD5

Hey @RafaD5! Looks like a bug. I'm pretty sure that data between different folds and models should be cleared. Do you observe the same behavior for the Compete mode? You can set validation_strategy={"validation_type" "kfold", "k_folds": 5, "shuffle": True} to have the same CV as in Perform mode.

pplonski avatar Apr 20 '21 07:04 pplonski

@pplonski I met the same situation several times. There is a memory leak.

xuzhang5788 avatar Apr 23 '21 05:04 xuzhang5788

@xuzhang5788 was it for Perform mode or other?

pplonski avatar Apr 23 '21 05:04 pplonski

Compete and Optuna mode. My case is like, in one notebook, if I noticed that the memory accumulated after I ran several automl.fit, then the kernal got killed. I have to restart my kernal every new training.

xuzhang5788 avatar Apr 23 '21 05:04 xuzhang5788

@xuzhang5788 thank you, I will work on it. Any help appreciated! :)

pplonski avatar Apr 23 '21 06:04 pplonski

@RafaD5 @xuzhang5788 I made few changes:

  • the data is not stored in the files during AutoML, I just keep a copy of the data. It looks like during save/load data to files there were some data copies created and not cleared. Storing directly in RAM should be faster and will use less memory because no leaks during save/load.
  • I added everywhere direct del statements on datasets and gc.collect().
  • I added an issue for LightGBM because looks like it is not deleting the memory after the training and del.

All changes are in the dev branch. You can install it:

pip install -q -U git+https://github.com/mljar/mljar-supervised.git@dev

I'm looking for your feedback! Thank you!

pplonski avatar Apr 29 '21 11:04 pplonski

It looks like that it was not improved a lot. I still can see that the memory was occupied gradually.

xuzhang5788 avatar May 03 '21 01:05 xuzhang5788

@xuzhang5788 yes, it is not fixed 100%. It should be slightly better and maybe not cause crashes. Looks like algorithms not from sklearn package doesn't release memory properly.

I will try to run ML training in separate processes, maybe this will help, but on the other hand I dont want to make over-complex code.

pplonski avatar May 04 '21 07:05 pplonski

This is still an issue, correct? I'm curious since I've been tinkering with mljar for numerai competitions. Seem to run out of memory - would run for 14 hours overnight and wake up to stalled computer (I have 64GB)

brickfrog avatar May 30 '21 15:05 brickfrog

@BrickFrog yes, it is still an issue.

pplonski avatar May 31 '21 06:05 pplonski

@BrickFrog have you used custom eval_metric when using AutoML on numerai data? It is possible to pass custom eval_metric like sharpe ratio to be optimized. There is also Spearman correlation built-in as eval_metric in MLJAR. Sorry if you couldnt find it in the docs. Please add github issue and I will fix the docs.

It is also possible to set-up custom validation strategy, by passing defined train/validation indices for each fold.

I have a plan to add tutorial/examples how mljar-supervised can be used with numerai data.

What is more, we are working on visual-notebook. It will be a desktop application for data science where user can click-out the solution, without heavy coding. I attach the screenshot (very development version). I would add blocks for numerai there (get latest data, upload submission). Screenshot from 2021-05-31 08-31-33

pplonski avatar May 31 '21 07:05 pplonski

while using "Compete"mode similar issues is still being faced While using "AutoML_class_obj = AutoML(data=data,mode ="Compete",eval_metric = "r2") using in compete mode with around 9998 training samples/records.Either it is getting crashed or it goes on with too many python programs running in task manager. 1.UserWarning:MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 9216 or by setting the environment variable OMP_NUM_THREADS=1 2.OSError: [WinError 1455] The paging file is too small for this operation to complete

sumanttyagi avatar Nov 10 '21 05:11 sumanttyagi

@sumanttyagi thank you for reporting. I understand that you are on the Windows system. Could you please post the full code with the data sample to reproduce the issue? Is it possible?

pplonski avatar Nov 10 '21 09:11 pplonski