Memory usege during training
Hi,
I've trained several models with mode="Perform" and when the training gets to certain point the python execution is killed because of the memory usage (I'm using a computer with 16 GB).
What I do is to rerun the script and change the model_name to the name of the model just created to resume training. A couple of times I've had to repeat this process twice.
It is not due to a single model but to data from previous models (already trained) that is not eliminated from memory.

Hey @RafaD5! Looks like a bug. I'm pretty sure that data between different folds and models should be cleared. Do you observe the same behavior for the Compete mode? You can set validation_strategy={"validation_type" "kfold", "k_folds": 5, "shuffle": True} to have the same CV as in Perform mode.
@pplonski I met the same situation several times. There is a memory leak.
@xuzhang5788 was it for Perform mode or other?
Compete and Optuna mode. My case is like, in one notebook, if I noticed that the memory accumulated after I ran several automl.fit, then the kernal got killed. I have to restart my kernal every new training.
@xuzhang5788 thank you, I will work on it. Any help appreciated! :)
@RafaD5 @xuzhang5788 I made few changes:
- the data is not stored in the files during AutoML, I just keep a copy of the data. It looks like during save/load data to files there were some data copies created and not cleared. Storing directly in RAM should be faster and will use less memory because no leaks during save/load.
- I added everywhere direct
delstatements on datasets andgc.collect(). - I added an issue for LightGBM because looks like it is not deleting the memory after the training and
del.
All changes are in the dev branch. You can install it:
pip install -q -U git+https://github.com/mljar/mljar-supervised.git@dev
I'm looking for your feedback! Thank you!
It looks like that it was not improved a lot. I still can see that the memory was occupied gradually.
@xuzhang5788 yes, it is not fixed 100%. It should be slightly better and maybe not cause crashes. Looks like algorithms not from sklearn package doesn't release memory properly.
I will try to run ML training in separate processes, maybe this will help, but on the other hand I dont want to make over-complex code.
This is still an issue, correct? I'm curious since I've been tinkering with mljar for numerai competitions. Seem to run out of memory - would run for 14 hours overnight and wake up to stalled computer (I have 64GB)
@BrickFrog yes, it is still an issue.
@BrickFrog have you used custom eval_metric when using AutoML on numerai data? It is possible to pass custom eval_metric like sharpe ratio to be optimized. There is also Spearman correlation built-in as eval_metric in MLJAR. Sorry if you couldnt find it in the docs. Please add github issue and I will fix the docs.
It is also possible to set-up custom validation strategy, by passing defined train/validation indices for each fold.
I have a plan to add tutorial/examples how mljar-supervised can be used with numerai data.
What is more, we are working on visual-notebook. It will be a desktop application for data science where user can click-out the solution, without heavy coding. I attach the screenshot (very development version). I would add blocks for numerai there (get latest data, upload submission).

while using "Compete"mode similar issues is still being faced While using "AutoML_class_obj = AutoML(data=data,mode ="Compete",eval_metric = "r2") using in compete mode with around 9998 training samples/records.Either it is getting crashed or it goes on with too many python programs running in task manager. 1.UserWarning:MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 9216 or by setting the environment variable OMP_NUM_THREADS=1 2.OSError: [WinError 1455] The paging file is too small for this operation to complete
@sumanttyagi thank you for reporting. I understand that you are on the Windows system. Could you please post the full code with the data sample to reproduce the issue? Is it possible?