LightGBM [Distributed][C-api]failed: Error in LGBM_BoosterUpdateOneIter: Socket recv error, Connection reset by peer (code: 104)

Description

I get errors when using 16 machines, each machine 16core, 32GB。

Dataset：13 million sample points, each sample point has 26 features, and the data size is 2.4G

The training task is two classifications

Environment info

LightGBM version or commit hash:latest

Command(s) you used to install LightGBM：Follow the official website tutorial

Code

int result = LGBM_NetworkInit(machines.c_str(), port, listenTimeOut, numMachines);
    if (result == -1) {
        LOG(FATAL) << "LGBM_NetworkInit failed with error: " << LGBM_GetLastError();
    }
    int iter = 0;
    while (true) {
        // Data structure
        std::vector<double> selectedData;
        auto& inputData = kDistributedData[rank];
        for (size_t i = 0; i < context.getNumMasterVertices(); ++i) {
            auto offset = i * decReg.size();
            for (size_t j = 0; j < decReg.size(); ++j) {
                if (decReg[j] >= 0) {
                    selectedData.push_back(inputData[offset + j]);
                }
            }
        }

// Create Data file

        // Create Dataset
        DatasetHandle dataset_handle;
        auto sampleNum = context.getNumMasterVertices();
        std::string datasetParam = "pre_partition=true bin_construct_sample_cnt=" + folly::to<std::string>(sampleNum);
        auto localResult = LGBM_DatasetCreateFromFile(data_name_stream.str().c_str(), datasetParam.c_str(), nullptr, &dataset_handle);
        if (localResult == -1) {
            std::string error_msg = "Error in LGBM_DatasetCreateFromFile: " + std::string(LGBM_GetLastError());
            throw std::runtime_error(error_msg);
        }

        // Create Booster
        BoosterHandle boosterHandle;
        // LGBM parameter settings
        LgbmParameters lgbmParams;
        lgbmParams.setJsonParameter(*lgbmParam_);
        if (numMachines > 1) {
            lgbmParams.setParameter("tree_learner", "data_parallel");
            lgbmParams.setParameter("num_machines", std::to_string(numMachines));
        }
        lgbmParams.setParameter("seed", std::to_string(seed_));
        lgbmParams.setParameter("num_threads", std::to_string(workerVCores_));
        if (rank == 0) {
            std::cout << "lgbmParams: " << lgbmParams.toString() << std::endl;
        }
        localResult = LGBM_BoosterCreate(dataset_handle, lgbmParams.toString().c_str(), &boosterHandle);
        if (localResult == -1) {
            std::string error_msg = "Error in LGBM_BoosterCreate: " + std::string(LGBM_GetLastError());
            throw std::runtime_error(error_msg);
        }

        auto startProgram = std::chrono::system_clock::now(); 
        // Train LGBM model
        for (int i = 0; i < epoch_; ++i) {
            int is_finished = 0;
            int ret = LGBM_BoosterUpdateOneIter(boosterHandle, &is_finished);
            if (ret != 0) {
                std::string error_msg = "Error in LGBM_BoosterUpdateOneIter: " + std::string(LGBM_GetLastError());
                throw std::runtime_error(error_msg);
            }
            double numResult = 0.0;
            int outLen = 0;
            if (i % logInterval_ == 0) {
                ret = LGBM_BoosterGetEval(boosterHandle, 0, &outLen, &numResult);
                if (ret == 0) {
                    std::cout << i << " iteration test Binary_logloss: " << std::setprecision(6) << numResult << std::endl;
                }
            }
            if (is_finished) break;
        }

Additional Comments

When the number of machines is 16, it has a higher probability of appearing. When the number of machines is 8, it does not appear for the time being.

### Tasks

Mar 15 '24 08:03 limingyao001

Thanks for using LightGBM.

LightGBM version or commit hash:latest Command(s) you used to install LightGBM：Follow the official website tutorial

These are not acceptable answers. "latest" is not a recognized git reference in this project. Please show the result of running this command on your clone of LightGBM:

git rev-parse HEAD

"Follow the official website tutorial" is also not very informative. This project's documentation describes dozens of different ways to build and install the library. Please share the exact commands you used, as the form asked for. For example, I can't tell from what you've provided whether you're using CPU-based or GPU-based training.

It'll also help if you provide the following information:

how are you invoking LightGBM? Your own C/C++ program, I guess?
what parameters (exact values) are you passing to LightGBM?
what operating system are you using? is it the same on all machines in the network?

Other things you might try:

can you observe resource utilization across the different worker machines? for example, maybe some of the training processes runing out of memory and being OOMKilled? LightGBM distributed training cannot currently tolerate any worker processes being lost (partially described in #3775)

Mar 16 '24 18:03 jameslamb

Thanks for using LightGBM.

LightGBM version or commit hash:latest Command(s) you used to install LightGBM：Follow the official website tutorial

These are not acceptable answers. "latest" is not a recognized git reference in this project. Please show the result of running this command on your clone of LightGBM:
git rev-parse HEAD
"Follow the official website tutorial" is also not very informative. This project's documentation describes dozens of different ways to build and install the library. Please share the exact commands you used, as the form asked for. For example, I can't tell from what you've provided whether you're using CPU-based or GPU-based training.

It'll also help if you provide the following information:

how are you invoking LightGBM? Your own C/C++ program, I guess?

what parameters (exact values) are you passing to LightGBM?

what operating system are you using? is it the same on all machines in the network?

Other things you might try:

can you observe resource utilization across the different worker machines? for example, maybe some of the training processes runing out of memory and being OOMKilled? LightGBM distributed training cannot currently tolerate any worker processes being lost (partially described in [dask] make Dask training resilient to worker restarts during network setup #3775)

I'm sorry for not providing all the details. Here are my answers. 1、This is the shell I used to download and install LightGBM:

git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM
mkdir build
cd build
cmake ..
make -j4
make -install

As you can see, there should be the latest version of LightGBM installed 2、At the same time, I use LGBM on the CPU, and the parameters are as follows:

LgbmParameters() { // Initialize default parameters parameters_["tree_learner"] = "data"; parameters_["num_machines"] = "16"; parameters_["num_threads"] = "7"; parameters_["task"] = "train"; parameters_["boosting_type"] = "gbdt"; parameters_["objective"] = "binary"; parameters_["metric"] = "binary_logloss"; parameters_["num_leaves"] = "10"; parameters_["learning_rate"] = "0.1"; parameters_["is_unbalance"] = "true"; parameters_["verbose"] = "1"; parameters_["max_depth"] = "5"; } I will splice these parameters into char* and pass them into the LGBM_BoosterCreate interface. 3、I used the C-api interface of LightGBM to call it. The calling process is as shown in the initial C++ code.

4、I am using centos, and I can ensure that all machines are the same because they are opened docker containers. At the same time, I also guarantee that OOMKilled will not occur, because each machine has 16core enabled and this error may still occur when it is 32G.

Of course, in the above situation, there will be a high probability of this error reporting, and there are also cases where the operation is successful.

Finally, thank you for your reply. If you have any questions, you can continue to contact me.

Mar 18 '24 02:03 limingyao001

LightGBM LightGBM copied to clipboard

[Distributed][C-api]failed: Error in LGBM_BoosterUpdateOneIter: Socket recv error, Connection reset by peer (code: 104)

Description

Environment info

Code

Additional Comments

LightGBM
LightGBM copied to clipboard