SynapseML
SynapseML copied to clipboard
OOM during model training because of coalesce
In case of big DataFrame (repartitioned one), even if there are a lot of memory, LightGBM is not able to train the model due to OOM. Basically, repartitioned DataFrame is ignored due to this line:
val df = dataset.toDF().coalesce(numWorkers)
https://github.com/Azure/mmlspark/blob/6a6d57f40ecd25a23efae29b2d18671647dbdb3f/src/lightgbm/src/main/scala/LightGBMBase.scala#L45
@REASY you could try to use batch training which would split the dataset and train on each part at a time:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMParams.scala#L164
val numBatches = new IntParam(this, "numBatches", "If greater than 0, splits data into separate batches during training") setDefault(numBatches -> 0)
the coalesce is done because LightGBM must have all partitions in memory for training on workers; splitting the dataset or increasing the size of the cluster are two possible solutions
@imatiach-msft thanks for help! I've tried to find that method, but seems like that functionality is new and it is not available in mmlspark v0.17
. Do you know is there snapshot version or nightly builds of master?
Hi, @imatiach-msft. Is there any update on this? Thanks!
hi @REASY , sorry for not replying sooner, yes there are builds with each PR. You can get the latest build from master last night here:
--packages com.microsoft.ml.spark:mmlspark_2.11:0.17.dev25 and --repositories https://mmlspark.azureedge.net/maven
Also, I added some improvements that may reduce memory significantly in some cases, such as when there are columns other than the features and label column in the dataset (which for one internal data scientist was also causing OOM).
Great, will try, thanks a lot @imatiach-msft !
@REASY thanks, please let me know if you are still seeing OOM with the new build
you can try A larger repartition number
df = df.repartition(num)
it solved my problem.