LightGBM numBatches and earlyStoppingRound conflict
SynapseML version
1.0.13
System information
- Language version python 3.9
Describe the problem
From glancing at the LightGBM code, I believe there is a conflict between the numBatches parameter and earlyStoppingRound. If you set both these params, I think that you may hit early stopping in the first batch and then never train on the remaining batches.
This would be very suboptimal. earlyStoppingRound is intended to increase generalization. It would be tragic if it causes training to see only a small fraction of the data, greatly reducing generalization.
I think that when both of these parameters are present, the early stopping should apply separately to each batch. I.E. when training has stopped making progress on the current batch, it should continue to the next batch.
Code to reproduce issue
model = LightGBMRegressor(
featuresCol="featureVector",
labelCol="label"
predictionCol="prediction",
validationIndicatorCol="validation",
numIterations=800,
numBatches=10,
earlyStoppingRound=1
)
model.fit(training_data) # may stop in the first batch, without seeing 90% of the training data
Other info / logs
No response
What component(s) does this bug affect?
- [ ]
area/cognitive: Cognitive project - [ ]
area/core: Core project - [ ]
area/deep-learning: DeepLearning project - [x]
area/lightgbm: Lightgbm project - [ ]
area/opencv: Opencv project - [ ]
area/vw: VW project - [ ]
area/website: Website - [ ]
area/build: Project build system - [ ]
area/notebooks: Samples under notebooks folder - [ ]
area/docker: Docker usage - [ ]
area/models: models related issue
What language(s) does this bug affect?
- [ ]
language/scala: Scala source code - [x]
language/python: Pyspark APIs - [ ]
language/r: R APIs - [ ]
language/csharp: .NET APIs - [ ]
language/new: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/synapse: Azure Synapse integrations - [ ]
integrations/azureml: Azure ML integrations - [ ]
integrations/databricks: Databricks integrations
Hey @DavidWAbrahams :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.
Hi @DavidWAbrahams, thank you for using SynapseML. I investigated your concern which showed that earlyStoppingRound reduces work within a batch while numBatches still drives the job through every batch—there’s no path where the first batch’s early stop prevents later batches from running. If anything still looks off in your run, let me know the configuration and logs so we can keep digging.
Some code snippets which might help:
- lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/LightGBMBase.scala:43-59 Multi-batch training always iterates through every split. The foldLeft never exits early; even when the previous batch sets early stopping, the loop still moves on to the next (datasetBatch, batchIndex) pair.
- lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/LightGBMBase.scala:249-287 & TrainUtils.scala:16-28 Before each new batch, the booster from the prior batch is serialized via setModelString(...). When the next batch starts, getGeneralParams injects that model string, and TrainUtils.createBooster merges it into the new booster. Each batch therefore resumes training from the accumulated model instead of starting from scratch.
- lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/BasePartitionTask.scala:34-46 Every batch constructs a fresh PartitionTaskTrainingState with iteration = 0 and clean “best” trackers. Early stopping counters start over at the beginning of each batch, so hitting the criterion in batch 0 does not suppress training in batch 1.
- lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/TrainUtils.scala:140-164, BasePartitionTask.scala:450-457, booster/LightGBMBooster.scala:445-451 Early stopping only halts additional boosting iterations inside the current batch. The booster is trimmed to the best iteration and shipped back to the driver, but the outer batch loop keeps running. The saved booster records its bestIteration, so subsequent batches keep training from that point while respecting the early-stopped iteration count.