LightGBM
LightGBM copied to clipboard
Median wrongly computed
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import pandas as pd
data = {"X": [1, 2, 3, 4, 5],
"Y": [1, 2, 3, 4, 100]}
df = pd.DataFrame(data)
X_train, y_train = df[["X"]], df["Y"]
# Create a LightGBM dataset from the training data
train_data = lgb.Dataset(X_train, label=y_train)
# Set the parameters for the LightGBM model
params = {
'boosting_type': 'gbdt',
'objective': 'regression_l1', # Use MAE for the objective
'metric': 'mae',
'num_leaves': 2,
'min_data_in_leaf': 10, # Set the minimum number of samples per leaf to 10
'learning_rate': 1,
'feature_fraction':1.0,
'num_iterations': 1, # Only use one tree
}
# Train the LightGBM model
model = lgb.train(params, train_data)
# Make predictions on the test data
y_pred = model.predict(X_train)
# Output the predictions
print("Predictions:")
print(y_pred)
Given the above codes, the tree won't split because there is not enough data, so it should output the median (3). However, the output is:
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 5, number of used features: 0
[LightGBM] [Info] Start training from score 3.500000
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
Predictions:
[3.5 3.5 3.5 3.5 3.5]
I find the issue. It's this line:
https://github.com/microsoft/LightGBM/blob/ac37bf8a1dcc981dedadd9943e338e10ec072c01/src/objective/regression_objective.hpp#L27
Given cnt_data = 5, and alpha = 0.5, the pos of median should be 2. but it gives 2.5.
The correct formula should be:
const double float_pos = static_cast<double>(cnt_data - 1) * alpha;
Is this intentionally or is it a bug?