LightGBM [CUDA] Crash when using device

Description

I'm trying to use LightGBM on a CUDA multi GPU NVidia V100 system and when device_type is set to cuda I'm getting a segmentation fault, if device_type=gpu it works fine. I'm using latest checkout from master build of LightGBM.

gdb) where
#0  0x00000ede46d7cfc9 in LightGBM::CUDARegressionObjectiveInterface<LightGBM::RegressionL2loss>::Init(LightGBM::Metadata const&, int) () from /usr/local/lib/lib_lightgbm.so
#1  0x00000ede46466302 in LightGBM::Booster::CreateObjectiveAndMetrics (this=0xedd2d9c2800) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:213
#2  0x00000ede4643b33f in LightGBM::Booster::Booster (this=0xedd2d9c2800, train_data=0xedd2d1e4680, parameters=0xedd30d4a900 "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=255 num_leaves=256 min_data_in_leaf=100 learning_rate=0.01 num_iterations=5000 feature_fraction=0.8 bagging_fraction=0.8 b"...) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:183
#3  LGBM_BoosterCreate (train_data=0xedd2d1e4680, parameters=0xedd30d4a900 "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=255 num_leaves=256 min_data_in_leaf=100 learning_rate=0.01 num_iterations=5000 feature_fraction=0.8 bagging_fraction=0.8 b"..., out=0x7ffc34df6c28) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:1944
#4  0x00000ede996a988d in svr::kernel::kernel_gbm<double>::init (this=this@entry=0xedd3263fcd0, X_t=..., Y=...) at /usr/include/c++/14/bits/basic_string.h:227
#5  0x00000ede99825dd2 in _ZN3svr9datamodel9OnlineSVR4tuneEv._omp_fn.0(void) () at /mnt/faststore/repo/tempus-core/SVRRoot/OnlineSVR/src/onlinesvr_tune_fast.cpp:145


(gdb) list
69      in ./nptl/pthread_mutex_trylock.c
(gdb) up
#1  0x00000ede46466302 in LightGBM::Booster::CreateObjectiveAndMetrics (this=0xedd2d9c2800) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:213
213           objective_fun_->Init(train_data_->metadata(), train_data_->num_data());
(gdb) list -10
198         boosting_->MergeFrom(other->boosting_.get());
199       }
200
201       ~Booster() {
202       }
203
204       void CreateObjectiveAndMetrics() {
205         // create objective function
206         objective_fun_.reset(ObjectiveFunction::CreateObjectiveFunction(config_.objective,
207                                                                         config_));

Parameters string is

s << "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=" LGBM_MAXBIN " num_leaves=256 min_data_in_leaf=100 learning_rate=" << PROPS.get_k_learn_rate() << " num_iterations=" << PROPS.get_k_epochs() <<
        " feature_fraction=0.8 bagging_fraction=0.8 bagging_freq=5 metric=l2 save_binary=true use_missing=false force_col_wise=true num_threads=" << C_n_cpu << " device_type=cuda num_gpu=" << common::gpu_handler_1::get().get_gpu_devices_count();

Reproducible example

Environment info

LightGBM version or commit hash:

Command(s) you used to install LightGBM

shell

20250705-05:27:46] zarko@tempus:/mnt/faststore/repo/tempus-core/build$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.10
Release:        24.10
Codename:       oracular


nvidia-smi 
Sat Jul  5 05:28:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-FHHL-16GB           On  |   00000000:03:00.0 Off |                    0 |
| N/A   36C    P0             24W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-FHHL-16GB           On  |   00000000:04:00.0 Off |                    0 |
| N/A   35C    P0             22W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-FHHL-16GB           On  |   00000000:05:00.0 Off |                    0 |
| N/A   34C    P0             23W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-FHHL-16GB           On  |   00000000:82:00.0 Off |                    0 |
| N/A   34C    P0             25W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Additional Comments

Jul 05 '25 05:07 asenzz

Thanks for using LightGBM.

Please, can you provide the types of information that the issue template asked for?

It's difficult to help if we cannot reproduce the issue.

version of LightGBM
exact commands you used to install it
minimal, reproducible example (code we could use to try to reproduce the error)

If you haven't seen it before, please also review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax. It has some advice for formatting text on GitHub.

Jul 07 '25 03:07 jameslamb

Version:

cat VERSION.txt 4.6.0.99

Last commit: commit e7c6c4371b5d725902a09a80b4d6c36e432a4381 (HEAD -> master, origin/master, origin/HEAD) Author: Nick Miller [email protected] Date: Fri Jun 20 20:59:13 2025 -0700

[ci] [R-package] Add period after specified linter names in `nolint` comments (#6950)

On branch master Your branch is up to date with 'origin/master'.

Install commands: cmake:


BUILD_CLI                        ON                                                                                                                                                                                                         
 BUILD_CPP_TEST                   OFF                                                                                                                                                                                                        
 BUILD_STATIC_LIB                 OFF                                                                                                                                                                                                        
 Boost_FILESYSTEM_LIBRARY_RELEA   /usr/local/lib/libboost_filesystem.so.1.87.0                                                                                                                                                               
 Boost_INCLUDE_DIR                /usr/local/include                                                                                                                                                                                         
 Boost_SYSTEM_LIBRARY_RELEASE     /usr/local/lib/libboost_system.so.1.87.0                                                                                                                                                                   
 CMAKE_BUILD_TYPE                 Release                                                                                                                                                                                                    
 CMAKE_CUDA_ARCHITECTURES         70                                                                                                                                                                                                         
 CMAKE_CXX_COMPILER_LAUNCHER      ccache                                                                                                                                                                                                     
 CMAKE_INSTALL_PREFIX             /usr/local                                                                                                                                                                                                 
 ENABLED_SANITIZERS                                                                                                                                                                                                                          
 INSTALL_HEADERS                  ON                                                                                                                                                                                                         
 USE_CUDA                         ON                                                                                                                                                                                                         
 USE_DEBUG                        OFF                                                                                                                                                                                                        
 USE_GPU                          ON                                                                                                                                                                                                         
 USE_HOMEBREW_FALLBACK            ON                                                                                                                                                                                                         
 USE_MPI                          ON                                                                                                                                                                                                         
 USE_OPENMP                       ON                                                                                                                                                                                                         
 USE_SANITIZER                    OFF                                                                                                                                                                                                        
 USE_SWIG                         OFF                                                                                                                                                                                                        
 USE_TIMETAG                      OFF                                                                                                                                                                                                        
 __BUILD_FOR_PYTHON               OFF                                                                                                                                                                                                        
 __BUILD_FOR_R                    OFF                                                                                                                                                                                                        
 __INTEGRATE_OPENCL               OFF                  

sudo make install

Minimal reproducible example:

#define LGBM_MAXBIN "255"

constexpr char C_lgbm_dataset_parameters[] = "max_bin=" LGBM_MAXBIN " use_missing=false save_binary=true";

std::string get_gbm_parameters(const uint16_t gpu_id)
{
    std::stringstream s;
    s << "objective=regression tree_learner=data num_leaves=256 early_stopping_round=200 seed=123 learning_rate=" << PROPS.get_k_learn_rate() << " num_iterations=" << PROPS.get_k_epochs()
        << " metric=l2 force_col_wise=true num_threads=" << C_n_cpu << " device_type=gpu " << C_lgbm_dataset_parameters;
#ifndef NDEBUG
    s << " verbosity=1 ";
#endif
    return s.str();
}

void main(const int argc, const char **argv)
{
    const uint32_t n_samples_2 = 4000000;
    const uint32_t n_manifold_features = 160;
    lg_errchk(LGBM_SetMaxThreads(C_n_cpu));
    DatasetHandle train_dataset;
    lg_errchk(LGBM_DatasetCreateFromMat(manifold_features_t.mem, C_API_DTYPE_FLOAT32,n_samples_2, n_manifold_features, 1, // is_row_major = 1 (row-major order)
        C_lgbm_dataset_parameters, nullptr, &train_dataset));

    lg_errchk(LGBM_DatasetSetField(train_dataset, "label", manifold_labels.mem, n_samples_2, C_API_DTYPE_FLOAT32));

    lg_errchk(LGBM_RegisterLogCallback(lgbm_log));

    BoosterHandle booster;
    common::gpu_context ctx;
    const auto gbm_parameters = get_gbm_parameters(ctx.phy_id());
    lg_errchk(LGBM_BoosterCreate(train_dataset, gbm_parameters.c_str(), &booster));
    int update_finished = 0;
    auto iter = PROPS.get_k_epochs() + 1;
    assert(iter);
    while (update_finished == 0 && --iter) lg_errchk(LGBM_BoosterUpdateOneIter(booster, &update_finished));

    int64_t model_size = 0;
    LGBM_BoosterSaveModelToString(booster, 0, 0, C_API_FEATURE_IMPORTANCE_SPLIT, 0, &model_size, nullptr);
    std::vector<char> model_str(model_size);
    lg_errchk(LGBM_BoosterSaveModelToString(booster, 0, 0, C_API_FEATURE_IMPORTANCE_SPLIT, model_size, &model_size, model_str.data()));
    lg_errchk(LGBM_BoosterFree(booster));
    lg_errchk(LGBM_DatasetFree(train_dataset));

}

Jul 07 '25 06:07 asenzz

Another issue I noticed in the same context as above, is that when I set device_type=gpu and gpu_device_id=X the program always uses the first GPU out of 4 available on the system. I tried it on two different servers with 4 x Nvidia V100 and 4 x A100 - same issue.

Oct 01 '25 17:10 asenzz

[CUDA] Crash when using device_type=cuda

Description

Reproducible example

Environment info

Additional Comments