xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Bootstrap Confidence Intervals for XGBoost regression (Python)

Open Shafi2016 opened this issue 4 years ago • 5 comments

I want to construct Bootstrap Confidence Intervals for XGBoost regression using python. I developed my case based on codes (https://machinelearningmastery.com/calculate-bootstrap-confidence-intervals-machine-learning-results-python/#comment-528118). Question: I am getting a one bin histogram. I get the single value for the score when we do n_iterations for the bootstrap. This is the problem and it is related to the way I am getting RMSE. Though I tried to find RMSE in different ways. yet, I could not solve the problem How can we solve it?

image

import numpy from pandas import read_csv from sklearn.datasets import load_boston from sklearn.utils import resample from matplotlib import pyplot from xgboost import XGBRegressor import pandas as pd import numpy as np from sklearn.metrics import mean_squared_error

load dataset

boston_dataset = load_boston()

df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)

df['MEDV'] = boston_dataset.target values1 = df.values

configure bootstrap

n_iterations = 1000 n_size = int(len(df) * 0.50)

run bootstrap

stats = list()

# prepare train and test sets

for i in range(n_iterations): # prepare train and test sets train = resample(values1, n_samples=n_size) test = numpy.array([x for x in values1 if x.tolist() not in train.tolist()])

model = XGBRegressor() ## Final for the papers

X_train = train[:,:-1] y_train = train[:,-1] X_test = test[:,:-1] y_test = test[:,-1]

model.fit(X_train,y_train) predictions = model.predict(X_test) # make predictions

def rmse_calculator(predicted, actual):

assert len(predicted) == len(actual)
return np.sqrt(
            np.mean(
                np.power(predicted- actual, 2)))
score = rmse_calculator(y_test , predictions)

#score = mean_squared_error(y_test, predictions) ** 0.5 yt = np.asarray(y_test) y_pred = np.asarray(predictions) score = np.sqrt(mean_squared_error(yt,y_pred)) print(score) stats.append(score)

plot scores

pyplot.hist(stats) pyplot.show()

confidence intervals

alpha = 0.95 p = ((1.0-alpha)/2.0) * 100 lower = max(0.0, numpy.percentile(stats, p)) p = (alpha+((1.0-alpha)/2.0)) * 100 upper = min(1.0, numpy.percentile(stats, p)) print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha100, lower100, upper*100))

Shafi2016 avatar Apr 02 '20 14:04 Shafi2016

Try plotting the data to confirm there is a distribution. Perhaps there is not.

If there is, try changing the number of bins in the histogram plot.

jbrownlee avatar Apr 02 '20 19:04 jbrownlee

Thanks a lot: Yes tried to change the number of bins but it did not work as:

sns.distplot(stats, hist=True, kde=False, bins=int(30/2), color = 'blue', hist_kws={'edgecolor':'black'})

I checked with XGBoost Classifier with the data (https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv). It works fine.

image

import numpy from pandas import read_csv from sklearn.utils import resample from xgboost import XGBClassifier from sklearn.metrics import accuracy_score from matplotlib import pyplot

load dataset

data = read_csv('pima-indians-diabetes.data.csv', header=None) values = data.values

configure bootstrap

n_iterations = 100 n_size = int(len(data) * 0.50)

run bootstrap

stats = list() for i in range(n_iterations): # prepare train and test sets train = resample(values, n_samples=n_size) test = numpy.array([x for x in values if x.tolist() not in train.tolist()]) # fit model model =XGBClassifier() model.fit(train[:,:-1], train[:,-1]) # evaluate model predictions = model.predict(test[:,:-1]) score = accuracy_score(test[:,-1], predictions) print(score) stats.append(score)

plot scores

pyplot.hist(stats) pyplot.show()

confidence intervals

alpha = 0.95 p = ((1.0-alpha)/2.0) * 100 lower = max(0.0, numpy.percentile(stats, p)) p = (alpha+((1.0-alpha)/2.0)) * 100 upper = min(1.0, numpy.percentile(stats, p)) print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha100, lower100, upper*100))

Shafi2016 avatar Apr 02 '20 20:04 Shafi2016

I also ploted the histogram of Prediction (XGBoost regression) It seems fine:

image

Shafi2016 avatar Apr 02 '20 21:04 Shafi2016

Hi, I have this error for classifier 'continuous is not supported' How can I solve it ?

yahmadyar95 avatar May 16 '22 09:05 yahmadyar95

Hi Dmlc/Xgboost,

Thanks for asking.

I’m eager to help, but I just don’t have the capacity to debug code for you.

I am happy to make some suggestions:

  • Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
  • Consider cutting the problem back to just one or a few simple examples.
  • Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
  • Consider posting your question and code to StackOverflow.

Regards,

Jason Brownlee, Ph.D. Making Developers Awesome at Machine Learning

Do you need help with machine learning? Visit: MachineLearningMastery.com http://machinelearningmastery.com/

On Mon, May 16, 2022 at 5:41 AM yahmadyar95 @.***> wrote:

Hi, I have this error for classifier 'continuous is not supported' How can I solve it ?

— Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/5475#issuecomment-1127452785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADEWZDJTLGX3WOFWRDHW3VKIJ3FANCNFSM4L2P7RMQ . You are receiving this because you commented.Message ID: @.***>

jbrownlee avatar May 16 '22 23:05 jbrownlee