LightGBM
LightGBM copied to clipboard
[Distributed][C-api]failed: Error in LGBM_BoosterUpdateOneIter: Socket recv error, Connection reset by peer (code: 104)
Description
I get errors when using 16 machines, each machine 16core, 32GB。
Dataset:13 million sample points, each sample point has 26 features, and the data size is 2.4G
The training task is two classifications
Environment info
LightGBM version or commit hash:latest
Command(s) you used to install LightGBM:Follow the official website tutorial
Code
int result = LGBM_NetworkInit(machines.c_str(), port, listenTimeOut, numMachines);
if (result == -1) {
LOG(FATAL) << "LGBM_NetworkInit failed with error: " << LGBM_GetLastError();
}
int iter = 0;
while (true) {
// Data structure
std::vector<double> selectedData;
auto& inputData = kDistributedData[rank];
for (size_t i = 0; i < context.getNumMasterVertices(); ++i) {
auto offset = i * decReg.size();
for (size_t j = 0; j < decReg.size(); ++j) {
if (decReg[j] >= 0) {
selectedData.push_back(inputData[offset + j]);
}
}
}
// Create Data file
// Create Dataset
DatasetHandle dataset_handle;
auto sampleNum = context.getNumMasterVertices();
std::string datasetParam = "pre_partition=true bin_construct_sample_cnt=" + folly::to<std::string>(sampleNum);
auto localResult = LGBM_DatasetCreateFromFile(data_name_stream.str().c_str(), datasetParam.c_str(), nullptr, &dataset_handle);
if (localResult == -1) {
std::string error_msg = "Error in LGBM_DatasetCreateFromFile: " + std::string(LGBM_GetLastError());
throw std::runtime_error(error_msg);
}
// Create Booster
BoosterHandle boosterHandle;
// LGBM parameter settings
LgbmParameters lgbmParams;
lgbmParams.setJsonParameter(*lgbmParam_);
if (numMachines > 1) {
lgbmParams.setParameter("tree_learner", "data_parallel");
lgbmParams.setParameter("num_machines", std::to_string(numMachines));
}
lgbmParams.setParameter("seed", std::to_string(seed_));
lgbmParams.setParameter("num_threads", std::to_string(workerVCores_));
if (rank == 0) {
std::cout << "lgbmParams: " << lgbmParams.toString() << std::endl;
}
localResult = LGBM_BoosterCreate(dataset_handle, lgbmParams.toString().c_str(), &boosterHandle);
if (localResult == -1) {
std::string error_msg = "Error in LGBM_BoosterCreate: " + std::string(LGBM_GetLastError());
throw std::runtime_error(error_msg);
}
auto startProgram = std::chrono::system_clock::now();
// Train LGBM model
for (int i = 0; i < epoch_; ++i) {
int is_finished = 0;
int ret = LGBM_BoosterUpdateOneIter(boosterHandle, &is_finished);
if (ret != 0) {
std::string error_msg = "Error in LGBM_BoosterUpdateOneIter: " + std::string(LGBM_GetLastError());
throw std::runtime_error(error_msg);
}
double numResult = 0.0;
int outLen = 0;
if (i % logInterval_ == 0) {
ret = LGBM_BoosterGetEval(boosterHandle, 0, &outLen, &numResult);
if (ret == 0) {
std::cout << i << " iteration test Binary_logloss: " << std::setprecision(6) << numResult << std::endl;
}
}
if (is_finished) break;
}
Additional Comments
When the number of machines is 16, it has a higher probability of appearing. When the number of machines is 8, it does not appear for the time being.
### Tasks
Thanks for using LightGBM.
LightGBM version or commit hash:latest Command(s) you used to install LightGBM:Follow the official website tutorial
These are not acceptable answers. "latest" is not a recognized git
reference in this project. Please show the result of running this command on your clone of LightGBM:
git rev-parse HEAD
"Follow the official website tutorial" is also not very informative. This project's documentation describes dozens of different ways to build and install the library. Please share the exact commands you used, as the form asked for. For example, I can't tell from what you've provided whether you're using CPU-based or GPU-based training.
It'll also help if you provide the following information:
- how are you invoking LightGBM? Your own C/C++ program, I guess?
- what parameters (exact values) are you passing to LightGBM?
- what operating system are you using? is it the same on all machines in the network?
Other things you might try:
- can you observe resource utilization across the different worker machines? for example, maybe some of the training processes runing out of memory and being OOMKilled? LightGBM distributed training cannot currently tolerate any worker processes being lost (partially described in #3775)
Thanks for using LightGBM.
LightGBM version or commit hash:latest Command(s) you used to install LightGBM:Follow the official website tutorial
These are not acceptable answers. "latest" is not a recognized
git
reference in this project. Please show the result of running this command on your clone of LightGBM:git rev-parse HEAD
"Follow the official website tutorial" is also not very informative. This project's documentation describes dozens of different ways to build and install the library. Please share the exact commands you used, as the form asked for. For example, I can't tell from what you've provided whether you're using CPU-based or GPU-based training.
It'll also help if you provide the following information:
- how are you invoking LightGBM? Your own C/C++ program, I guess?
- what parameters (exact values) are you passing to LightGBM?
- what operating system are you using? is it the same on all machines in the network?
Other things you might try:
- can you observe resource utilization across the different worker machines? for example, maybe some of the training processes runing out of memory and being OOMKilled? LightGBM distributed training cannot currently tolerate any worker processes being lost (partially described in [dask] make Dask training resilient to worker restarts during network setup #3775)
I'm sorry for not providing all the details. Here are my answers. 1、This is the shell I used to download and install LightGBM:
git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM
mkdir build
cd build
cmake ..
make -j4
make -install
As you can see, there should be the latest version of LightGBM installed 2、At the same time, I use LGBM on the CPU, and the parameters are as follows:
LgbmParameters() { // Initialize default parameters parameters_["tree_learner"] = "data"; parameters_["num_machines"] = "16"; parameters_["num_threads"] = "7"; parameters_["task"] = "train"; parameters_["boosting_type"] = "gbdt"; parameters_["objective"] = "binary"; parameters_["metric"] = "binary_logloss"; parameters_["num_leaves"] = "10"; parameters_["learning_rate"] = "0.1"; parameters_["is_unbalance"] = "true"; parameters_["verbose"] = "1"; parameters_["max_depth"] = "5"; } I will splice these parameters into char* and pass them into the LGBM_BoosterCreate interface. 3、I used the C-api interface of LightGBM to call it. The calling process is as shown in the initial C++ code.
4、I am using centos, and I can ensure that all machines are the same because they are opened docker containers. At the same time, I also guarantee that OOMKilled will not occur, because each machine has 16core enabled and this error may still occur when it is 32G.
Of course, in the above situation, there will be a high probability of this error reporting, and there are also cases where the operation is successful.
Finally, thank you for your reply. If you have any questions, you can continue to contact me.