SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

OOM during model training because of coalesce

Open REASY opened this issue 5 years ago • 7 comments

In case of big DataFrame (repartitioned one), even if there are a lot of memory, LightGBM is not able to train the model due to OOM. Basically, repartitioned DataFrame is ignored due to this line:

val df = dataset.toDF().coalesce(numWorkers)

https://github.com/Azure/mmlspark/blob/6a6d57f40ecd25a23efae29b2d18671647dbdb3f/src/lightgbm/src/main/scala/LightGBMBase.scala#L45

REASY avatar May 23 '19 08:05 REASY

@REASY you could try to use batch training which would split the dataset and train on each part at a time:

https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMParams.scala#L164

val numBatches = new IntParam(this, "numBatches", "If greater than 0, splits data into separate batches during training") setDefault(numBatches -> 0)

the coalesce is done because LightGBM must have all partitions in memory for training on workers; splitting the dataset or increasing the size of the cluster are two possible solutions

imatiach-msft avatar May 29 '19 04:05 imatiach-msft

@imatiach-msft thanks for help! I've tried to find that method, but seems like that functionality is new and it is not available in mmlspark v0.17. Do you know is there snapshot version or nightly builds of master?

REASY avatar Jun 03 '19 07:06 REASY

Hi, @imatiach-msft. Is there any update on this? Thanks!

REASY avatar Jul 02 '19 15:07 REASY

hi @REASY , sorry for not replying sooner, yes there are builds with each PR. You can get the latest build from master last night here:

--packages com.microsoft.ml.spark:mmlspark_2.11:0.17.dev25 and --repositories https://mmlspark.azureedge.net/maven

Also, I added some improvements that may reduce memory significantly in some cases, such as when there are columns other than the features and label column in the dataset (which for one internal data scientist was also causing OOM).

imatiach-msft avatar Jul 03 '19 04:07 imatiach-msft

Great, will try, thanks a lot @imatiach-msft !

REASY avatar Jul 03 '19 05:07 REASY

@REASY thanks, please let me know if you are still seeing OOM with the new build

imatiach-msft avatar Jul 03 '19 15:07 imatiach-msft

you can try A larger repartition number df = df.repartition(num) it solved my problem.

alexab612 avatar Apr 28 '24 09:04 alexab612