SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

LightGBM task encountered empty partition, for best performance ensure no partitions empty

Open Shafi2016 opened this issue 3 years ago • 1 comments

Hi, This issue was reported earlier by other people as well. But I could find the solution. The codes give below. I have a large data set.

MMLSpark version: com.microsoft.azure:synapseml_2.12:0.9.5 System version: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0] PySpark version: 3.3.0

[Features_col = df2.schema.names[0:-1]
assembler = VectorAssembler(inputCols=Features_col, outputCol="features")
assembler.setHandleInvalid("skip").transform(df2).show
stages = [assembler]
df5 = assembler.transform(df2)
# test3 = assembler.transform(test2)
train, test = df5.randomSplit([0.85, 0.15], seed=786)

Features_col = df2.schema.names[0:-1]
assembler = VectorAssembler(inputCols=Features_col, outputCol="features")
assembler.setHandleInvalid("skip").transform(df2).show
stages = [assembler]
df5 = assembler.transform(df2)
# test3 = assembler.transform(test2)
train, test = df5.randomSplit([0.85, 0.15], seed=786)

from synapse.ml.automl import *
from synapse.ml.train import *
import sklearn.metrics as metrics
labelCol = "target"
lgbmc = LightGBMClassifier(boostingType='dart',
                           objective= 'binary',
                           metric= 'auc',
                           isUnbalance= True,
                           numIterations= 300)

smlmodels = [lgbmc]
mmlmodels = [TrainClassifier(model=model, labelCol= labelCol) for model in smlmodels]

paramBuilder = (HyperparamBuilder()
.addHyperparam(lgbmc, lgbmc.learningRate, RangeHyperParam(0.01, 0.5))
.addHyperparam(lgbmc, lgbmc.maxDepth, DiscreteHyperParam([1,30]))
.addHyperparam(lgbmc, lgbmc.numLeaves, DiscreteHyperParam([10,200]))
.addHyperparam(lgbmc, lgbmc.featureFraction, RangeHyperParam(0.1, 1.0))
.addHyperparam(lgbmc, lgbmc.baggingFraction, RangeHyperParam(0.1, 1.0))
.addHyperparam(lgbmc, lgbmc.baggingFreq, RangeHyperParam(0, 3))
)

searchSpace = paramBuilder.build()

randomSpace = RandomSpace(searchSpace)

bestModel = TuneHyperparameters(evaluationMetric="AUC", models=mmlmodels, numFolds=2, 
                                numRuns=len(mmlmodels) * 2, parallelism=1, 
                                paramSpace=randomSpace.space(), seed=0).fit(train)]

image

AB#1865176

Shafi2016 avatar Jul 08 '22 04:07 Shafi2016

@Shafi2016 it looks like you have some empty partitions in your spark dataframe. That is just a warning though. The problem is that you are running out of memory, based on the error "OutOfMemoryError: Java heap space". You can try to fix the warning by calling repartition on your dataset, but even if it fixes the warning, the problem may remain. To solve the problem, you can try to increase the size of your cluster or downsample the dataset. We are also working on a streaming solution at the moment which would reduce memory usage even more.

imatiach-msft avatar Jul 08 '22 20:07 imatiach-msft

The low-memory streaming solution is a WIP, so closing this issue for now. Please reopen if needed.

svotaw avatar Sep 12 '22 18:09 svotaw