LightGBM
LightGBM copied to clipboard
[python-package] Segmentation fault with CUDA version in Python interface (core dumped)
Description
I installed lightgbm-4.3.0.0, cuda version. After data loaded and transported to GPU, execution just stopped. Below is the log. GPU memory is about 12GB while the data is 6GB.
[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] Using customized objective with cuda. This requires copying gradients from CPU to GPU, which can be slow.
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Total Bins 2438
[LightGBM] [Info] Number of data points in the train set: 35322835, number of used features: 48
[LightGBM] [Warning] Using customized objective with cuda. This requires copying gradients from CPU to GPU, which can be slow.
[LightGBM] [Info] Using self-defined objective function
Segmentation fault (core dumped)
Reproducible example
params = {
'task': 'train',
'objective': focal_loss_obj,
'max_bin': 63,
'num_leaves': 255,
'min_data_in_leaf': 20,
'max_depth': 15,
'learning_rate': 0.05,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'num_class': 4,
'n_jobs': -1,
'random_state': 42,
'boosting_type': 'gbdt',
'device': 'cuda'
}
gbm = lgb.train(
params,
train_set=lgb_train,
valid_sets=(lgb_train, lgb_eval),
valid_names=('fit', 'eval'),
num_boost_round=10000,
callbacks=[lgb.early_stopping(stopping_rounds=100), lgb.record_evaluation(eval_result)],
feval=gmean_score
)
Environment info
LightGBM version or commit hash: 4.3.0.0
Command(s) you used to install LightGBM
pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm
Additional Comments
Thanks for using LightGBM, and for the well-formatted report.
We'd be happy to help, but there are some things you can do to narrow down the issue further and reduce the effort that'll be required to find the root cause.
- Can you please provide the code for
focal_loss_obj()
andgmean_score()
? - If you use LightGBM built-in loss function and metrics, does LightGBM still segfault? If not, then the issue might be somewhere in your implementations of those functions.
- Alternatively... if you can't share the dataset you're using, can you try with the exact same parameters, loss function, metrics, etc. but a public dataset, like those available from
scikit-learn
viasklearn.datasets
? And report what happens? - Can you try removing parameters from
params
one-by-one and try to reduce it to the smallest set of non-default values that still produces the problem? For example, if you removebagging_fraction
andfeature_fraction
and still see a segfault, that's very helpful because it tells us the issue is not related to subsampling of rows and columns inside LightGBM.
Also note that I've reformatted your original post slightly to make the difference between code, your own words, and text printed by code clearer. You can click ... -> Edit
in GitHub to see what that looks like in raw markdown form.
If you're unsure how I did that, please review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.
Thank you @jameslamb for the kindly reply. I took a new year's leave and back to work today. There is one import thing that I forgot to post here, that is when I reduced the number of data points in the train set to a smaller one (e.g. 100,000), it worked. So maybe it is the data problem?
Please provide the details I asked for at https://github.com/microsoft/LightGBM/issues/6300#issuecomment-1927231076 to help us eliminate possible causes.
Please provide the details I asked for at #6300 (comment) to help us eliminate possible causes.
Hi @jameslamb , I re-run my code with nothing changed but train data, it was replaced by iris data from sklearn.datasets.load_iris
. Suprisingly, it worked.
Seems related to the cuda version. I will investigate this.
@leedchou Could you provide the implementation of focal_loss_obj
?
In addition, if you could provide a minimal example for reproducing the error, that would be very helpful.
Thank you @shiyu1994 , I'd love to show you the implementaion of focal_loss_obj
and an example. It would be great if I can get your email address, so I can send you an example by email.
@leedchou Thanks. You may send that to my personal email [email protected]. It would also be great if you could post the example here for clear and open discussion.
It would also be great if you could post the example here for clear and open discussion.
Please do this, @leedchou, so that everyone finding this discussion from search in the future can learn from it and so that others can contribute to helping.
It would also be great if you could post the example here for clear and open discussion.
Please do this, @leedchou, so that everyone finding this discussion from search in the future can learn from it and so that others can contribute to helping.
Ok, I'll post it here @shiyu1994 .
focal_loss_obj:
def focal_loss_lgb(y_pred, dtrain, alpha, gamma=2, num_class=4):
target = dtrain.get_label()
grad = np.zeros((len(target), num_class), dtype=float)
hess = np.zeros((len(target), num_class), dtype=float)
y_true = np.eye(num_class)[target.astype('int')] # one-hot
y_pred = y_pred.reshape(len(target), num_class, order='F')
softmax_p = special.softmax(y_pred, axis=-1)
for c in range(num_class):
pc = softmax_p[:, c]
pt = softmax_p[:][y_true == 1]
grad[:, c][y_true[:, c] == 1] = (gamma * np.power(1 - pt[y_true[:, c] == 1], gamma - 1) * pt[y_true[:, c] == 1] * np.log(pt[y_true[:, c] == 1]) - np.power(1 - pt[y_true[:, c] == 1], gamma)) * (1 - pc[y_true[:, c] == 1])
grad[:, c][y_true[:, c] == 0] = (gamma * np.power(1 - pt[y_true[:, c] == 0], gamma - 1) * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) - np.power(1 - pt[y_true[:, c] == 0], gamma)) * (0 - pc[y_true[:, c] == 0])
hess[:, c][y_true[:, c] == 1] = (-4 * (1 - pt[y_true[:, c] == 1]) * pt[y_true[:, c] == 1] * np.log(pt[y_true[:, c] == 1]) + np.power(1 - pt[y_true[:, c] == 1], 2) * (2 * np.log(pt[y_true[:, c] == 1]) + 5)) * pt[y_true[:, c] == 1] * (1 - pt[y_true[:, c] == 1])
hess[:, c][y_true[:, c] == 0] = pt[y_true[:, c] == 0] * np.power(pc[y_true[:, c] == 0], 2) * (-2 * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) + 2 * (1 - pt[y_true[:, c] == 0]) * np.log(pt[y_true[:, c] == 0]) + 4 * (1 - pt[y_true[:, c] == 0])) - pc[y_true[:, c] == 0] * (1 - pc[y_true[:, c] == 0]) * (1 - pt[y_true[:, c] == 0]) * (2 * pt[y_true[:, c] == 0] * np.log(pt[y_true[:, c] == 0]) - (1 - pt[y_true[:, c] == 0]))
alpha = np.array([alpha[i] for i in target.astype('int')])[:, np.newaxis]
grad = alpha * grad
hess = alpha * hess
return grad.flatten('F'), hess.flatten('F')
train example:
class_weights = [1, 1, 1, 1]
focal_loss_obj = lambda x, y: focal_loss_lgb(x, y, alpha=class_weights, gamma=2, num_class=4)
gmean_score = lambda x, y: gmean_metric(x, y, num_class=4)
params = {
'objective': focal_loss_obj,
'task': 'train',
'max_bin': 255,
'num_leaves': 255,
'min_data_in_leaf': 20,
'max_depth': 15,
'learning_rate': 0.05,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'num_class': 4,
'n_jobs': -1,
'random_state': 42,
'boosting_type': 'gbdt',
'device': 'cuda',
# 'gpu_platform_id': 0,
# 'gpu_device_id': 0,
}
eval_result = {}
gbm = lgb.train(params,
train_set=lgb_train,
valid_sets=(lgb_train, lgb_eval),
valid_names=('fit', 'eval'),
num_boost_round=10000,
callbacks=[lgb.early_stopping(stopping_rounds=100), lgb.record_evaluation(eval_result)],
feval=gmean_score
)