LightGBM The training results of CUDA and CPU differ with the same dataset and parameters using LightGBM 4.1.0

Description

The training results of CUDA and CPU differ with the same dataset and parameters

Reproducible example

import numpy as np
import lightgbm as lgb

N, k = int(1e7), int(1e1)
np.random.seed(0)
X = np.random.normal(0, 1, (N, k))
beta = np.random.normal(0, 1, k)
epsilon = np.random.normal(0, 10, N)
Y = X.dot(beta) + epsilon
W = np.abs(np.random.normal(0, 1, N))
train_set = lgb.Dataset(X, label = Y, weight = W)

params = {
    "objective": 'regression',
    "max_bin": 63,
    "num_leaves": 63,
    "learning_rate": 0.1,
    "force_row_wise": True,
    'verbose': 1,
    "deterministic": True,
}
params_gpu = params.copy()
params_gpu.update({'device_type': 'cuda'})
params_cpu = params.copy()

model_gpu = lgb.train(params_gpu, train_set, num_boost_round = 100)
model_cpu = lgb.train(params_cpu, train_set, num_boost_round = 100)

y_pred_gpu = model_gpu.predict(X)
y_pred_cpu = model_cpu.predict(X)
print(y_pred_gpu)
print(y_pred_cpu)
y_dif = np.abs(y_pred_gpu - y_pred_cpu)
print(np.max(y_dif), np.mean(y_dif))

the output is as follows

[LightGBM] [Warning] Although "deterministic" is set, the results ran by GPU may be non-deterministic.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] Although "deterministic" is set, the results ran by GPU may be non-deterministic.
[LightGBM] [Info] Total Bins 630
[LightGBM] [Info] Number of data points in the train set: 10000000, number of used features: 10
[LightGBM] [Info] Start training from score -0.001554

[LightGBM] [Warning] Although "deterministic" is set, the results ran by GPU may be non-deterministic.
[LightGBM] [Info] Total Bins 630
[LightGBM] [Info] Number of data points in the train set: 10000000, number of used features: 10
[LightGBM] [Info] Start training from score -0.001554

[ 0.29451256 -1.27554319 -4.77001593 ...  0.43338232 -3.89348605
 -0.54135903]
[ 0.32018665 -1.26717548 -4.67964188 ...  0.4488524  -3.88445162
 -0.5272684 ]
2.458828499401017 0.030497075739206972

Environment info

LightGBM version or commit hash: v4.1.0

Command(s) you used to install LightGBM

cmake -DUSE_CUDA=1 -DUSE_CPU=1 ..

The GPU device is A100

Additional Comments

Nov 22 '23 08:11 w158rk

@shiyu1994 can you please answer this one?

Nov 22 '23 13:11 jameslamb

any response?

Dec 20 '23 05:12 w158rk

Thanks for providing the example.

The CUDA and CPU versions may have minor differences in implementation. But in general these differences do not result in a big difference in performance. We will take more effort to make these two versions as consistent as possible.

Do you observe the difference between the performance metrics of these two versions?

Dec 20 '23 15:12 shiyu1994

Performance: We were expecting that CUDA can make some improvement in performance, but the results showed that in some cases CUDA version was slower. The reason might be that our problem size is too small for A100.

Deterministic property: Is there any way I can get exactly same model from CPU version and GPU version? @shiyu1994

Jan 18 '24 02:01 w158rk

@w158rk What's the sample size and feature number of your datasets?

For now, the implementation of CUDA version still has some minor differences when compared with the CPU version. You may try with the older GPU version with device_type=gpu to see if it produces consistent results with CPU.

Feb 05 '24 08:02 shiyu1994

Just like the example: N, k = int(1e7), int(1e1)

OK, I'll try. Do you have any plans about ensuring the consistence of CUDA version and CPU version? Can I expect this feature to be implemented in near future?

Feb 05 '24 08:02 w158rk

@w158rk Sure, we will ensure the consistency recently. Perhaps together with the next one or two releases.

However, I still feel it is unreasonable that CUDA version should be slower than CPU in your example. I'll try your examples.

Feb 05 '24 08:02 shiyu1994

That's great! BTW, This example is just for illustrating the problem. The actual data size and model structure is different from this example. We just want to make sure that it is practical to use the CUDA version. We'll select the right version based on the performance of the actual applications.

Feb 05 '24 08:02 w158rk

Sorry, I was to mean unreasonable. I just corrected. I'll profile the CUDA and CPU versions with your scripts.

Feb 05 '24 08:02 shiyu1994

Excluding the dataset construction time, I run the following code with 1 A100 GPU. The cuda version is about 6 times faster than cpu version.

import numpy as np
import lightgbm as lgb
from time import time

N, k = int(1e7), int(1e1)
np.random.seed(0)
X = np.random.normal(0, 1, (N, k))
beta = np.random.normal(0, 1, k)
epsilon = np.random.normal(0, 10, N)
Y = X.dot(beta) + epsilon
W = np.abs(np.random.normal(0, 1, N))
train_set = lgb.Dataset(X, label = Y, weight = W, params={"max_bin": 63, "device_type": "cuda"})
train_set.construct()

params = {
    "objective": 'regression',
    "num_leaves": 63,
    "learning_rate": 0.1,
    "force_row_wise": True,
    'verbose': 2,
    "deterministic": True,
    "num_threads": 16
}
params_gpu = params.copy()
params_gpu.update({'device_type': 'cuda'})
params_cpu = params.copy()

gpu_start = time()
model_gpu = lgb.train(params_gpu, train_set, num_boost_round = 100)
print("finished gpu in %f" % (time() - gpu_start))
cpu_start = time()
model_cpu = lgb.train(params_cpu, train_set, num_boost_round = 100)
print("finished cpu in %f" % (time() - cpu_start))

y_pred_gpu = model_gpu.predict(X)
y_pred_cpu = model_cpu.predict(X)
print(y_pred_gpu)
print(y_pred_cpu)
y_dif = np.abs(y_pred_gpu - y_pred_cpu)
print(np.max(y_dif), np.mean(y_dif))

[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] Although "deterministic" is set, the results ran by GPU may be non-deterministic.
[LightGBM] [Warning] Although "deterministic" is set, the results ran by GPU may be non-deterministic.
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Info] Total Bins 630
[LightGBM] [Info] Number of data points in the train set: 10000000, number of used features: 10
[LightGBM] [Debug] Adding init score = -0.001554
[LightGBM] [Info] Start training from score -0.001554
finished gpu in 2.202397
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Info] Total Bins 630
[LightGBM] [Info] Number of data points in the train set: 10000000, number of used features: 10
[LightGBM] [Info] Start training from score -0.001554
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 18
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 16
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 16
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 18
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 20
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 21
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 19
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 20
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 17
finished cpu in 12.829864
[ 0.29451256 -1.27554319 -4.77001595 ...  0.43338231 -3.89348603
 -0.54135903]
[ 0.32018665 -1.26717548 -4.67964188 ...  0.4488524  -3.88445162
 -0.5272684 ]
2.4588284718378937 0.030497075864870184

Feb 05 '24 09:02 shiyu1994

It's true, this example runs faster with CUDA version on my server as well. I'll let you know if I can find an example with unexpected performance results. Thx!

Feb 05 '24 09:02 w158rk

Thanks. That would be very helpful!

Feb 05 '24 09:02 shiyu1994

LightGBM LightGBM copied to clipboard

The training results of CUDA and CPU differ with the same dataset and parameters using LightGBM 4.1.0

Description

Reproducible example

Environment info

Additional Comments

LightGBM
LightGBM copied to clipboard