LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

[python-package] Segmentation fault with CUDA version in Python interface (core dumped)

Open leedchou opened this issue 5 months ago • 12 comments

Description

I installed lightgbm-4.3.0.0, cuda version. After data loaded and transported to GPU, execution just stopped. Below is the log. GPU memory is about 12GB while the data is 6GB.

[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] Using customized objective with cuda. This requires copying gradients from CPU to GPU, which can be slow.
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Total Bins 2438
[LightGBM] [Info] Number of data points in the train set: 35322835, number of used features: 48
[LightGBM] [Warning] Using customized objective with cuda. This requires copying gradients from CPU to GPU, which can be slow.
[LightGBM] [Info] Using self-defined objective function
Segmentation fault (core dumped)

Reproducible example

params = {
    'task': 'train',
    'objective': focal_loss_obj,
    'max_bin': 63,
    'num_leaves': 255,
    'min_data_in_leaf': 20,
    'max_depth': 15,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'num_class': 4,
    'n_jobs': -1,
    'random_state': 42,
    'boosting_type': 'gbdt',
    'device': 'cuda'
}

gbm = lgb.train(
    params,
    train_set=lgb_train,
    valid_sets=(lgb_train, lgb_eval),
    valid_names=('fit', 'eval'),
    num_boost_round=10000,
    callbacks=[lgb.early_stopping(stopping_rounds=100), lgb.record_evaluation(eval_result)],
    feval=gmean_score
)

Environment info

LightGBM version or commit hash: 4.3.0.0 Command(s) you used to install LightGBM

pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm

Additional Comments

leedchou avatar Feb 05 '24 09:02 leedchou

Thanks for using LightGBM, and for the well-formatted report.

We'd be happy to help, but there are some things you can do to narrow down the issue further and reduce the effort that'll be required to find the root cause.

  • Can you please provide the code for focal_loss_obj() and gmean_score()?
  • If you use LightGBM built-in loss function and metrics, does LightGBM still segfault? If not, then the issue might be somewhere in your implementations of those functions.
  • Alternatively... if you can't share the dataset you're using, can you try with the exact same parameters, loss function, metrics, etc. but a public dataset, like those available from scikit-learn via sklearn.datasets? And report what happens?
  • Can you try removing parameters from params one-by-one and try to reduce it to the smallest set of non-default values that still produces the problem? For example, if you remove bagging_fraction and feature_fraction and still see a segfault, that's very helpful because it tells us the issue is not related to subsampling of rows and columns inside LightGBM.

jameslamb avatar Feb 05 '24 15:02 jameslamb

Also note that I've reformatted your original post slightly to make the difference between code, your own words, and text printed by code clearer. You can click ... -> Edit in GitHub to see what that looks like in raw markdown form.

If you're unsure how I did that, please review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.

jameslamb avatar Feb 05 '24 15:02 jameslamb

Thank you @jameslamb for the kindly reply. I took a new year's leave and back to work today. There is one import thing that I forgot to post here, that is when I reduced the number of data points in the train set to a smaller one (e.g. 100,000), it worked. So maybe it is the data problem?

leedchou avatar Feb 19 '24 03:02 leedchou

Please provide the details I asked for at https://github.com/microsoft/LightGBM/issues/6300#issuecomment-1927231076 to help us eliminate possible causes.

jameslamb avatar Feb 19 '24 05:02 jameslamb

Please provide the details I asked for at #6300 (comment) to help us eliminate possible causes.

Hi @jameslamb , I re-run my code with nothing changed but train data, it was replaced by iris data from sklearn.datasets.load_iris. Suprisingly, it worked.

leedchou avatar Feb 20 '24 09:02 leedchou

Seems related to the cuda version. I will investigate this.

shiyu1994 avatar Feb 20 '24 09:02 shiyu1994

@leedchou Could you provide the implementation of focal_loss_obj?

shiyu1994 avatar Feb 20 '24 10:02 shiyu1994

In addition, if you could provide a minimal example for reproducing the error, that would be very helpful.

shiyu1994 avatar Feb 20 '24 10:02 shiyu1994

Thank you @shiyu1994 , I'd love to show you the implementaion of focal_loss_obj and an example. It would be great if I can get your email address, so I can send you an example by email.

leedchou avatar Feb 21 '24 02:02 leedchou

@leedchou Thanks. You may send that to my personal email [email protected]. It would also be great if you could post the example here for clear and open discussion.

shiyu1994 avatar Feb 21 '24 02:02 shiyu1994

It would also be great if you could post the example here for clear and open discussion.

Please do this, @leedchou, so that everyone finding this discussion from search in the future can learn from it and so that others can contribute to helping.

jameslamb avatar Feb 21 '24 02:02 jameslamb

It would also be great if you could post the example here for clear and open discussion.

Please do this, @leedchou, so that everyone finding this discussion from search in the future can learn from it and so that others can contribute to helping.

Ok, I'll post it here @shiyu1994 .

focal_loss_obj:

def focal_loss_lgb(y_pred, dtrain, alpha, gamma=2, num_class=4):
    target = dtrain.get_label()
    grad = np.zeros((len(target), num_class), dtype=float)
    hess = np.zeros((len(target), num_class), dtype=float)

    y_true = np.eye(num_class)[target.astype('int')]  # one-hot
    y_pred = y_pred.reshape(len(target), num_class, order='F')
    softmax_p = special.softmax(y_pred, axis=-1)

    for c in range(num_class):
        pc = softmax_p[:, c]
        pt = softmax_p[:][y_true == 1]
        grad[:, c][y_true[:, c] == 1] = (gamma * np.power(1 - pt[y_true[:, c] == 1], gamma - 1) * pt[y_true[:, c] == 1] * np.log(pt[y_true[:, c] == 1]) - np.power(1 - pt[y_true[:, c] == 1], gamma)) * (1 - pc[y_true[:, c] == 1])
        grad[:, c][y_true[:, c] == 0] = (gamma * np.power(1 - pt[y_true[:, c] == 0], gamma - 1) * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) - np.power(1 - pt[y_true[:, c] == 0], gamma)) * (0 - pc[y_true[:, c] == 0])
        hess[:, c][y_true[:, c] == 1] = (-4 * (1 - pt[y_true[:, c] == 1]) * pt[y_true[:, c] == 1] * np.log(pt[y_true[:, c] == 1]) + np.power(1 - pt[y_true[:, c] == 1], 2) * (2 * np.log(pt[y_true[:, c] == 1]) + 5)) * pt[y_true[:, c] == 1] * (1 - pt[y_true[:, c] == 1])
        hess[:, c][y_true[:, c] == 0] = pt[y_true[:, c] == 0] * np.power(pc[y_true[:, c] == 0], 2) * (-2 * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) + 2 * (1 - pt[y_true[:, c] == 0]) * np.log(pt[y_true[:, c] == 0]) + 4 * (1 - pt[y_true[:, c] == 0])) - pc[y_true[:, c] == 0] * (1 - pc[y_true[:, c] == 0]) * (1 - pt[y_true[:, c] == 0]) * (2 * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) - (1 - pt[y_true[:, c] == 0]))

    alpha = np.array([alpha[i] for i in target.astype('int')])[:, np.newaxis]
    grad = alpha * grad
    hess = alpha * hess

    return grad.flatten('F'), hess.flatten('F')

train example:

    class_weights = [1, 1, 1, 1]
    focal_loss_obj = lambda x, y: focal_loss_lgb(x, y, alpha=class_weights, gamma=2, num_class=4)
    gmean_score = lambda x, y: gmean_metric(x, y, num_class=4)

    params = {
        'objective': focal_loss_obj,
        'task': 'train',
        'max_bin': 255,
        'num_leaves': 255,
        'min_data_in_leaf': 20,
        'max_depth': 15,
        'learning_rate': 0.05,
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'num_class': 4,
        'n_jobs': -1,
        'random_state': 42,
        'boosting_type': 'gbdt',
        'device': 'cuda',
        # 'gpu_platform_id': 0,
        # 'gpu_device_id': 0,
    }
    eval_result = {}
    gbm = lgb.train(params,
                    train_set=lgb_train,
                    valid_sets=(lgb_train, lgb_eval),
                    valid_names=('fit', 'eval'),
                    num_boost_round=10000,
                    callbacks=[lgb.early_stopping(stopping_rounds=100), lgb.record_evaluation(eval_result)],
                    feval=gmean_score
                    )

leedchou avatar Feb 21 '24 09:02 leedchou