LightGBM [dask] [python-package] Early stopping causes DaskLGBMClassifier to hang

Description

I am training a LightGBM model in a distributed cluster setting using the Dask interface. When I set the early_stopping_rounds parameter, it causes the training job to hang indefinitely whenever the condition for early stopping seems to be triggered.

For example, here are the logs for one of the machines in the cluster where early_stopping_rounds is set to a value of 4:

2024-03-04T16:58:39.986-05:00 | [85]#011train's binary_logloss: 0.0897921#011validation's binary_logloss: 0.0998643
-- | --
  | 2024-03-04T16:58:39.986-05:00 | [85]#011train's binary_logloss: 0.0897921#011validation's binary_logloss: 0.0998643
  | 2024-03-04T16:58:45.000-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:58:45.000-05:00 | [86]#011train's binary_logloss: 0.0896056#011validation's binary_logloss: 0.100004
  | 2024-03-04T16:58:45.000-05:00 | [86]#011train's binary_logloss: 0.0896056#011validation's binary_logloss: 0.100004
  | 2024-03-04T16:58:50.001-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:58:50.002-05:00 | [87]#011train's binary_logloss: 0.0894872#011validation's binary_logloss: 0.0999423
  | 2024-03-04T16:58:50.002-05:00 | [87]#011train's binary_logloss: 0.0894872#011validation's binary_logloss: 0.0999423
  | 2024-03-04T16:58:56.011-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:58:56.011-05:00 | [88]#011train's binary_logloss: 0.0893585#011validation's binary_logloss: 0.0999157
  | 2024-03-04T16:58:56.011-05:00 | [88]#011train's binary_logloss: 0.0893585#011validation's binary_logloss: 0.0999157
  | 2024-03-04T16:59:00.013-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:59:00.013-05:00 | [89]#011train's binary_logloss: 0.0892702#011validation's binary_logloss: 0.0998719
  | 2024-03-04T16:59:00.013-05:00 | [89]#011train's binary_logloss: 0.0892702#011validation's binary_logloss: 0.0998719
  | 2024-03-04T16:59:00.013-05:00 | [LightGBM] [Info] Finished linking network in 382.644965 seconds

It seems like in a distributed training Dask setting, if any of the individual workers in the cluster hits the early stopping condition, then the entire job just hangs indefinitely. No error. No warning.

Reproducible example

hyperparameters = {
    "num_estimators": 225,
    "early_stopping_rounds": 4,
    "max_depth": 8
}
lightgbm_trainer = lightgbm.DaskLGBMClassifier(
      client=client, silent=False, **hyperparameters
)
callbacks = [
    lightgbm.log_evaluation(period=1),
]
info_level_verbosity = 1
lightgbm_trainer.fit(
    X=X_train,
    y=y_train,
    sample_weight=train_sample_weights,
    eval_set=eval_set,
    eval_names=eval_names,
    eval_metric=args.eval_metric,
    callbacks=callbacks,
    verbose=info_level_verbosity,
)

Environment info

LightGBM version or commit hash: 3.3.5

Command(s) you used to install LightGBM

lightgbm==3.3.5

Linux via the python:3.9.16-bullseye Docker image.

Additional Comments

Mar 04 '24 23:03 tristers-at-square

Thanks for using LightGBM.

Early stopping is not currently supported in the Dask interface. You can subscribe to #3712 to be notified when that work is picked up. We'd also welcome a contribution if you'd like to contribute it!

If you're using lightgbm.dask, please upgrade to at least LightGBM 4.0 (and preferably to the latest version, v4.3.0). There have been 2+ years of improvements and bug fixes since v3.3.5

Mar 05 '24 14:03 jameslamb

Thanks for using LightGBM.

Early stopping is not currently supported in the Dask interface. You can subscribe to #3712 to be notified when that work is picked up. We'd also welcome a contribution if you'd like to contribute it!

If you're using lightgbm.dask, please upgrade to at least LightGBM 4.0 (and preferably to the latest version, v4.3.0). There have been 2+ years of improvements and bug fixes since v3.3.5

Ah okay. In that case, the dask interface should throw a warning if early_stopping_rounds is passed in. Maybe even an error since passing it in seems to actually cause issues.

Will also give the new version a shot, thanks!

Mar 05 '24 17:03 tristers-at-square