LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

[CUDA] Crash when using device_type=cuda

Open asenzz opened this issue 7 months ago • 3 comments

Description

I'm trying to use LightGBM on a CUDA multi GPU NVidia V100 system and when device_type is set to cuda I'm getting a segmentation fault, if device_type=gpu it works fine. I'm using latest checkout from master build of LightGBM.

gdb) where
#0  0x00000ede46d7cfc9 in LightGBM::CUDARegressionObjectiveInterface<LightGBM::RegressionL2loss>::Init(LightGBM::Metadata const&, int) () from /usr/local/lib/lib_lightgbm.so
#1  0x00000ede46466302 in LightGBM::Booster::CreateObjectiveAndMetrics (this=0xedd2d9c2800) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:213
#2  0x00000ede4643b33f in LightGBM::Booster::Booster (this=0xedd2d9c2800, train_data=0xedd2d1e4680, parameters=0xedd30d4a900 "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=255 num_leaves=256 min_data_in_leaf=100 learning_rate=0.01 num_iterations=5000 feature_fraction=0.8 bagging_fraction=0.8 b"...) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:183
#3  LGBM_BoosterCreate (train_data=0xedd2d1e4680, parameters=0xedd30d4a900 "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=255 num_leaves=256 min_data_in_leaf=100 learning_rate=0.01 num_iterations=5000 feature_fraction=0.8 bagging_fraction=0.8 b"..., out=0x7ffc34df6c28) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:1944
#4  0x00000ede996a988d in svr::kernel::kernel_gbm<double>::init (this=this@entry=0xedd3263fcd0, X_t=..., Y=...) at /usr/include/c++/14/bits/basic_string.h:227
#5  0x00000ede99825dd2 in _ZN3svr9datamodel9OnlineSVR4tuneEv._omp_fn.0(void) () at /mnt/faststore/repo/tempus-core/SVRRoot/OnlineSVR/src/onlinesvr_tune_fast.cpp:145


(gdb) list
69      in ./nptl/pthread_mutex_trylock.c
(gdb) up
#1  0x00000ede46466302 in LightGBM::Booster::CreateObjectiveAndMetrics (this=0xedd2d9c2800) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:213
213           objective_fun_->Init(train_data_->metadata(), train_data_->num_data());
(gdb) list -10
198         boosting_->MergeFrom(other->boosting_.get());
199       }
200
201       ~Booster() {
202       }
203
204       void CreateObjectiveAndMetrics() {
205         // create objective function
206         objective_fun_.reset(ObjectiveFunction::CreateObjectiveFunction(config_.objective,
207                                                                         config_));


Parameters string is

s << "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=" LGBM_MAXBIN " num_leaves=256 min_data_in_leaf=100 learning_rate=" << PROPS.get_k_learn_rate() << " num_iterations=" << PROPS.get_k_epochs() <<
        " feature_fraction=0.8 bagging_fraction=0.8 bagging_freq=5 metric=l2 save_binary=true use_missing=false force_col_wise=true num_threads=" << C_n_cpu << " device_type=cuda num_gpu=" << common::gpu_handler_1::get().get_gpu_devices_count();

Reproducible example

Environment info

LightGBM version or commit hash:

Command(s) you used to install LightGBM

shell

20250705-05:27:46] zarko@tempus:/mnt/faststore/repo/tempus-core/build$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.10
Release:        24.10
Codename:       oracular


nvidia-smi 
Sat Jul  5 05:28:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-FHHL-16GB           On  |   00000000:03:00.0 Off |                    0 |
| N/A   36C    P0             24W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-FHHL-16GB           On  |   00000000:04:00.0 Off |                    0 |
| N/A   35C    P0             22W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-FHHL-16GB           On  |   00000000:05:00.0 Off |                    0 |
| N/A   34C    P0             23W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-FHHL-16GB           On  |   00000000:82:00.0 Off |                    0 |
| N/A   34C    P0             25W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Additional Comments

asenzz avatar Jul 05 '25 05:07 asenzz

Thanks for using LightGBM.

Please, can you provide the types of information that the issue template asked for?

It's difficult to help if we cannot reproduce the issue.

  • version of LightGBM
  • exact commands you used to install it
  • minimal, reproducible example (code we could use to try to reproduce the error)

If you haven't seen it before, please also review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax. It has some advice for formatting text on GitHub.

jameslamb avatar Jul 07 '25 03:07 jameslamb

Version:

cat VERSION.txt 4.6.0.99

Last commit: commit e7c6c4371b5d725902a09a80b4d6c36e432a4381 (HEAD -> master, origin/master, origin/HEAD) Author: Nick Miller [email protected] Date: Fri Jun 20 20:59:13 2025 -0700

[ci] [R-package] Add period after specified linter names in `nolint` comments (#6950)

On branch master Your branch is up to date with 'origin/master'.

Install commands: cmake:


BUILD_CLI                        ON                                                                                                                                                                                                         
 BUILD_CPP_TEST                   OFF                                                                                                                                                                                                        
 BUILD_STATIC_LIB                 OFF                                                                                                                                                                                                        
 Boost_FILESYSTEM_LIBRARY_RELEA   /usr/local/lib/libboost_filesystem.so.1.87.0                                                                                                                                                               
 Boost_INCLUDE_DIR                /usr/local/include                                                                                                                                                                                         
 Boost_SYSTEM_LIBRARY_RELEASE     /usr/local/lib/libboost_system.so.1.87.0                                                                                                                                                                   
 CMAKE_BUILD_TYPE                 Release                                                                                                                                                                                                    
 CMAKE_CUDA_ARCHITECTURES         70                                                                                                                                                                                                         
 CMAKE_CXX_COMPILER_LAUNCHER      ccache                                                                                                                                                                                                     
 CMAKE_INSTALL_PREFIX             /usr/local                                                                                                                                                                                                 
 ENABLED_SANITIZERS                                                                                                                                                                                                                          
 INSTALL_HEADERS                  ON                                                                                                                                                                                                         
 USE_CUDA                         ON                                                                                                                                                                                                         
 USE_DEBUG                        OFF                                                                                                                                                                                                        
 USE_GPU                          ON                                                                                                                                                                                                         
 USE_HOMEBREW_FALLBACK            ON                                                                                                                                                                                                         
 USE_MPI                          ON                                                                                                                                                                                                         
 USE_OPENMP                       ON                                                                                                                                                                                                         
 USE_SANITIZER                    OFF                                                                                                                                                                                                        
 USE_SWIG                         OFF                                                                                                                                                                                                        
 USE_TIMETAG                      OFF                                                                                                                                                                                                        
 __BUILD_FOR_PYTHON               OFF                                                                                                                                                                                                        
 __BUILD_FOR_R                    OFF                                                                                                                                                                                                        
 __INTEGRATE_OPENCL               OFF                  

sudo make install

Minimal reproducible example:

#define LGBM_MAXBIN "255"

constexpr char C_lgbm_dataset_parameters[] = "max_bin=" LGBM_MAXBIN " use_missing=false save_binary=true";

std::string get_gbm_parameters(const uint16_t gpu_id)
{
    std::stringstream s;
    s << "objective=regression tree_learner=data num_leaves=256 early_stopping_round=200 seed=123 learning_rate=" << PROPS.get_k_learn_rate() << " num_iterations=" << PROPS.get_k_epochs()
        << " metric=l2 force_col_wise=true num_threads=" << C_n_cpu << " device_type=gpu " << C_lgbm_dataset_parameters;
#ifndef NDEBUG
    s << " verbosity=1 ";
#endif
    return s.str();
}

void main(const int argc, const char **argv)
{
    const uint32_t n_samples_2 = 4000000;
    const uint32_t n_manifold_features = 160;
    lg_errchk(LGBM_SetMaxThreads(C_n_cpu));
    DatasetHandle train_dataset;
    lg_errchk(LGBM_DatasetCreateFromMat(manifold_features_t.mem, C_API_DTYPE_FLOAT32,n_samples_2, n_manifold_features, 1, // is_row_major = 1 (row-major order)
        C_lgbm_dataset_parameters, nullptr, &train_dataset));

    lg_errchk(LGBM_DatasetSetField(train_dataset, "label", manifold_labels.mem, n_samples_2, C_API_DTYPE_FLOAT32));

    lg_errchk(LGBM_RegisterLogCallback(lgbm_log));

    BoosterHandle booster;
    common::gpu_context ctx;
    const auto gbm_parameters = get_gbm_parameters(ctx.phy_id());
    lg_errchk(LGBM_BoosterCreate(train_dataset, gbm_parameters.c_str(), &booster));
    int update_finished = 0;
    auto iter = PROPS.get_k_epochs() + 1;
    assert(iter);
    while (update_finished == 0 && --iter) lg_errchk(LGBM_BoosterUpdateOneIter(booster, &update_finished));

    int64_t model_size = 0;
    LGBM_BoosterSaveModelToString(booster, 0, 0, C_API_FEATURE_IMPORTANCE_SPLIT, 0, &model_size, nullptr);
    std::vector<char> model_str(model_size);
    lg_errchk(LGBM_BoosterSaveModelToString(booster, 0, 0, C_API_FEATURE_IMPORTANCE_SPLIT, model_size, &model_size, model_str.data()));
    lg_errchk(LGBM_BoosterFree(booster));
    lg_errchk(LGBM_DatasetFree(train_dataset));

}

asenzz avatar Jul 07 '25 06:07 asenzz

Another issue I noticed in the same context as above, is that when I set device_type=gpu and gpu_device_id=X the program always uses the first GPU out of 4 available on the system. I tried it on two different servers with 4 x Nvidia V100 and 4 x A100 - same issue.

asenzz avatar Oct 01 '25 17:10 asenzz