SynapseML
SynapseML copied to clipboard
LightGBM task encountered empty partition, for best performance ensure no partitions empty
Hi, This issue was reported earlier by other people as well. But I could find the solution. The codes give below. I have a large data set.
MMLSpark version: com.microsoft.azure:synapseml_2.12:0.9.5 System version: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0] PySpark version: 3.3.0
[Features_col = df2.schema.names[0:-1]
assembler = VectorAssembler(inputCols=Features_col, outputCol="features")
assembler.setHandleInvalid("skip").transform(df2).show
stages = [assembler]
df5 = assembler.transform(df2)
# test3 = assembler.transform(test2)
train, test = df5.randomSplit([0.85, 0.15], seed=786)
Features_col = df2.schema.names[0:-1]
assembler = VectorAssembler(inputCols=Features_col, outputCol="features")
assembler.setHandleInvalid("skip").transform(df2).show
stages = [assembler]
df5 = assembler.transform(df2)
# test3 = assembler.transform(test2)
train, test = df5.randomSplit([0.85, 0.15], seed=786)
from synapse.ml.automl import *
from synapse.ml.train import *
import sklearn.metrics as metrics
labelCol = "target"
lgbmc = LightGBMClassifier(boostingType='dart',
objective= 'binary',
metric= 'auc',
isUnbalance= True,
numIterations= 300)
smlmodels = [lgbmc]
mmlmodels = [TrainClassifier(model=model, labelCol= labelCol) for model in smlmodels]
paramBuilder = (HyperparamBuilder()
.addHyperparam(lgbmc, lgbmc.learningRate, RangeHyperParam(0.01, 0.5))
.addHyperparam(lgbmc, lgbmc.maxDepth, DiscreteHyperParam([1,30]))
.addHyperparam(lgbmc, lgbmc.numLeaves, DiscreteHyperParam([10,200]))
.addHyperparam(lgbmc, lgbmc.featureFraction, RangeHyperParam(0.1, 1.0))
.addHyperparam(lgbmc, lgbmc.baggingFraction, RangeHyperParam(0.1, 1.0))
.addHyperparam(lgbmc, lgbmc.baggingFreq, RangeHyperParam(0, 3))
)
searchSpace = paramBuilder.build()
randomSpace = RandomSpace(searchSpace)
bestModel = TuneHyperparameters(evaluationMetric="AUC", models=mmlmodels, numFolds=2,
numRuns=len(mmlmodels) * 2, parallelism=1,
paramSpace=randomSpace.space(), seed=0).fit(train)]

AB#1865176
@Shafi2016 it looks like you have some empty partitions in your spark dataframe. That is just a warning though. The problem is that you are running out of memory, based on the error "OutOfMemoryError: Java heap space". You can try to fix the warning by calling repartition on your dataset, but even if it fixes the warning, the problem may remain. To solve the problem, you can try to increase the size of your cluster or downsample the dataset. We are also working on a streaming solution at the moment which would reduce memory usage even more.
The low-memory streaming solution is a WIP, so closing this issue for now. Please reopen if needed.