SynapseML OOM during model training because of coalesce

In case of big DataFrame (repartitioned one), even if there are a lot of memory, LightGBM is not able to train the model due to OOM. Basically, repartitioned DataFrame is ignored due to this line:

val df = dataset.toDF().coalesce(numWorkers)

https://github.com/Azure/mmlspark/blob/6a6d57f40ecd25a23efae29b2d18671647dbdb3f/src/lightgbm/src/main/scala/LightGBMBase.scala#L45

May 23 '19 08:05 REASY

@REASY you could try to use batch training which would split the dataset and train on each part at a time:

https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMParams.scala#L164

val numBatches = new IntParam(this, "numBatches", "If greater than 0, splits data into separate batches during training") setDefault(numBatches -> 0)

the coalesce is done because LightGBM must have all partitions in memory for training on workers; splitting the dataset or increasing the size of the cluster are two possible solutions

May 29 '19 04:05 imatiach-msft

@imatiach-msft thanks for help! I've tried to find that method, but seems like that functionality is new and it is not available in mmlspark v0.17. Do you know is there snapshot version or nightly builds of master?

Jun 03 '19 07:06 REASY

Hi, @imatiach-msft. Is there any update on this? Thanks!

Jul 02 '19 15:07 REASY

hi @REASY , sorry for not replying sooner, yes there are builds with each PR. You can get the latest build from master last night here:

--packages com.microsoft.ml.spark:mmlspark_2.11:0.17.dev25 and --repositories https://mmlspark.azureedge.net/maven

Also, I added some improvements that may reduce memory significantly in some cases, such as when there are columns other than the features and label column in the dataset (which for one internal data scientist was also causing OOM).

Jul 03 '19 04:07 imatiach-msft

Great, will try, thanks a lot @imatiach-msft !

Jul 03 '19 05:07 REASY

@REASY thanks, please let me know if you are still seeing OOM with the new build

Jul 03 '19 15:07 imatiach-msft

you can try A larger repartition number df = df.repartition(num) it solved my problem.

Apr 28 '24 09:04 alexab612

SynapseML SynapseML copied to clipboard

OOM during model training because of coalesce

SynapseML
SynapseML copied to clipboard