ngboost icon indicating copy to clipboard operation
ngboost copied to clipboard

Enhancement request: use xgboost as base learner

Open ivan-marroquin opened this issue 3 years ago • 18 comments

Hi all,

I have Python 3.6.5 with xgboost 1.1.0 and ngboost 0.3.10

So, when I train a NGBRegressor with xgboost as base learner, I get the following warning message:

c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py:445: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption "memory consumption")

which may be the source of the poor result shown on the plot on the left in attached image.

Is it possible to use xgboost as a base learner? Please advise.

The code source is as follows:

import numpy as np import xgboost as xgb import ngboost from sklearn.tree import DecisionTreeRegressor from sklearn.datasets import load_boston from sklearn.metrics import median_absolute_error from sklearn.model_selection import train_test_split import multiprocessing

if name == 'main': cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

 x, y= load_boston(return_X_y= True)

y= y.astype(np.float32)

x= ((x - np.mean(x, axis= 0)) / np.std(x, axis= 0)).astype(np.float32)

x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

Using xgboost with ngboost

learner= xgb.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 'reg:squarederror', booster= 'gbtree', tree_method= 'exact', n_jobs= cpu_count, learning_rate= 0.05, gamma= 0.15, reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)

ngb_1= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, Base= learner, 
                                                    natural_gradient= True, n_estimators= 1, learning_rate= 0.01, verbose= False, 
                                                    random_state= 1969)

ngb_1.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation)

y_preds_1= ngb_1.predict(x_validation)

median_abs_error_1= median_absolute_error(y_validation, y_preds_1)

# Using only ngboost
learner= DecisionTreeRegressor(max_depth= 6, criterion= 'friedman_mse', min_impurity_decrease= 0, random_state= 1969)

ngb_2= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, natural_gradient= True, 
                                                    n_estimators= 300, learning_rate= 0.01, verbose= False, random_state= 1969)

ngb_2.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation)

y_preds_2= ngb_2.predict(x_validation)

median_abs_error_2= median_absolute_error(y_validation, y_preds_2)

# Generate plot to compare results
fig, ax= plt.subplots(nrows= 1, ncols= 2)

ax[0].plot(range(0,len(y_validation)), y_validation, '-k')

ax[0].plot(range(0,len(y_validation)), y_preds_1, '--r')

ax[0].set_title("XGBOOST + NGBOOST: \n MedianAbsError {:.4f}".format(median_abs_error_1))

ax[1].plot(range(0,len(y_validation)), y_validation, '-k')

ax[1].plot(range(0,len(y_validation)), y_preds_2, '--r')

ax[1].set_title("NGBOOST \n MedianAbsError {:.4f}".format(median_abs_error_2))

comparison_xgboost-ngboost_against_only_ngboost.zip

ivan-marroquin avatar Apr 28 '21 16:04 ivan-marroquin

You would want to do at least two changes to your code:

  1. The base learner needs to be a Python constructor, so that each boosting stage gets its own model. Whereas in your case it is a pre-instantiated object which is getting repurposed/refit (i.e. modified) for every future boosting stage. In effect your whole boosted model is no more expressive than a single base learner.

  2. Ideally you want your base learner xgboost to have n_estimators=1 and the NGBoost model to have n_estimators=300, and not the other way around.

this is an interesting experiment and would love to see how it works out! Thanks for giving it a shot and sharing the results!

avati avatar Apr 28 '21 16:04 avati

Hi @avati

Thanks for your prompt answer. I made the chance to the code, in which the xgboost n_estimators= 1 while NGBoost n_estimators = 300. Unfortunately, I still get the same result.

By any chance, do you have a Python code example on how to change the xgboost model to be more like a Python constructor?

Ivan

ivan-marroquin avatar Apr 28 '21 18:04 ivan-marroquin

Here's one way. Instead of:

learner = xgb.XGBRegressor(...)

do:

learner = lambda args: xgb.XGBRegressor(args)

avati avatar Apr 28 '21 18:04 avati

Hi @avati

Thanks for the suggestion. Before pursuing more work with xgboost, I tried the following code:

#_________________ from sklearn.ensemble import GradientBoostingRegressor

learner= GradientBoostingRegressor(loss= 'ls', learning_rate= 0.05, n_estimators= 1, criterion= 'mse', max_depth= 6, min_impurity_decrease= 0, random_state= 1969)

ngb= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, Base= learner, natural_gradient= True, n_estimators= 300, learning_rate= 0.01, verbose= False, random_state= 1969) ngb.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation)

y_preds= ngb.predict(x_validation) #_________________

It gave a reasonable result which could be improved by playing with the hyperparameters.

This shows the strength of NGBoost to take learners from scikit-learn library.

On the other hand, the xgboost (although I am using its scikit-learn api) does not seem to work well with NGBoost - as you well explained. Could be possible that the xgboost's api library is missing something required by NGBoost?

Do you have more suggestions?

Ivan

ivan-marroquin avatar Apr 28 '21 19:04 ivan-marroquin

The same suggestion as my previous comment. Use learner with a 'lambda' as shown, whether it is for XGB or GBR.

avati avatar May 02 '21 06:05 avati

Hi @avati

Thanks for the suggestion, I tried the command with lambda, and get this message:

Cannot clone object '<function at 0x000001F05A98A840>' (type <class 'function'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods

I am pretty sure that I am missing something on how to implement this approach. Could you provide a more detailed code example?

Ivan

ivan-marroquin avatar May 03 '21 14:05 ivan-marroquin

I also want to use LightGBM as base learner and the same issue with @ivan-marroquin , Could you provide some advise?

caiquanyou avatar Jul 21 '21 01:07 caiquanyou

Hi @caiquanyou

I think that way on how to run xgboost with ngboost (and perhaps, it applies as well to lightgbm). I found this publication: https://www.researchgate.net/publication/349528379_Reliable_Evapotranspiration_Predictions_with_a_Probabilistic_Machine_Learning_Framework

and the code source used in this publication can be found at: https://codeocean.com/capsule/5244281/tree/v1

to make it work with xgboost, it is required to set number of estimators (along with the number of trees used in ngboost). I have xgboost 1.1.0 and ngboost 0.3.10.

I used the toy example used by ngboost (adapted to work with xgboost):

import numpy as np import ngboost import xgboost from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split import multiprocessing import matplotlib.pyplot as plt

if name == 'main': cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

x, y= load_boston(return_X_y= True)

mean_scaler= np.mean(x, axis= 0)

std_scaler= np.std(x, axis= 0)

x= (x - mean_scaler) / std_scaler

x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

# using only ngboost
ngb_1= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE,
                                                    natural_gradient= True, n_estimators= 300, learning_rate= 0.01, 
                                                    verbose= False, random_state= 1969)

ngb_1.fit(x_train, y_train)

y_preds_ngboost= ngb_1.predict(x_validation)

# using xgboost with ngboost
learner= xgboost.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 
                                                    'reg:squarederror',  booster= 'gbtree', tree_method= 'exact', n_jobs= 
                                                    cpu_count, learning_rate= 0.05, gamma= 0.15, reg_alpha= 0.20, 
                                                    reg_lambda= 0.50, random_state= 1969)


ngb_2= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE, Base= 
                                                    learner, natural_gradient= True, n_estimators= 300, learning_rate= 
                                                    0.01, verbose= False, random_state= 1969)

ngb_2.fit(x_train, y_train)

y_preds_hyboost= ngb_2.predict(x_validation)

fig, ax= plt.subplots(nrows= 1, ncols= 3, figsize= (10,5))    

ax[0].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[0].plot(range(0,len(x_validation)), y_preds_ngboost, '--r', label= 'ngboost')    
ax[0].set_title("NGBOOST: validation & prediction")
ax[0].legend()

ax[1].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[1].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[1].set_title("HYBOOST: validation & prediction")
ax[1].legend()

ax[2].plot(range(0,len(x_validation)), y_preds_ngboost, '-k', label= 'ngboost')    
ax[2].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[2].set_title("NGBOOST - HYBOOST: prediction")
ax[2].legend()

plt.show()

Note that xgboost will raise the following warning message: Warning (from warnings module): File "C:\Temp\Python\Python3.6.5\lib\site-packages\xgboost\core.py", line 445 "memory consumption") UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption

I don't know whether this issue may influence the quality of the result. Let me know what do you find on your side,

Hope this helps,

Ivan

ivan-marroquin avatar Jul 30 '21 14:07 ivan-marroquin

That warning shouldn't influence the predictions, but will increase the ram consumption of the computation. I'd be interested in hearing more experiences with using other packages as the Base learner.

thomasaarholt avatar Aug 06 '21 07:08 thomasaarholt

In case it's useful, I've written a "native" xgboost version of ngboost, implemented in the xgboost scitkit-learn API.

CDonnerer avatar Aug 14 '21 09:08 CDonnerer

Exciting! Looking forward to checking it out!

thomasaarholt avatar Aug 14 '21 10:08 thomasaarholt

In case it's useful, I've written a "native" xgboost version of ngboost, implemented in the xgboost scitkit-learn API.

This is fantastic @CDonnerer. If you're willing, I'd love to have features like these ported into the core NGBoost library. We've had previous discussions on how to make ngboost faster and easier to develop that you would be more than welcome to contribute to.

alejandroschuler avatar Aug 14 '21 21:08 alejandroschuler

In case it's useful, I've written a "native" xgboost version of ngboost, implemented in the xgboost scitkit-learn API.

Really cool library! Related question: does xgboost-distribution offer a gpu implementation like xgboost, or nah? I'm assuming the relative performance numbers are for runs on the CPU, right?

astrogilda avatar Aug 15 '21 05:08 astrogilda

@alejandroschuler Thanks! Sure, I'll have a look at those discussions, there might be options to port those features across in a generic way.

@astrogilda No GPU support for xgboost-distribution yet, indeed, the performance numbers refer to CPU runs.

CDonnerer avatar Aug 15 '21 15:08 CDonnerer

@CDonnerer - just want to say that's a fantastic library you've written. I don't know how practical it would be to port the features over to NGboost as @alejandroschuler suggested, and the coding is way over my head. If that's at all possible, as a user, that would be a great solution (rather than having forked development across two different probabilistic libraries). This would be especially helpful for the purposes of adding additional distribution support in a consistent way.

kmedved avatar Aug 16 '21 22:08 kmedved

@CDonnerer seems like there is quite some overlap with XGBoostLSS, an approach I have developed in 2019

https://github.com/StatMixedML/XGBoostLSS

StatMixedML avatar Dec 18 '21 09:12 StatMixedML

@StatMixedML thanks for sharing the link of your approach!

ivan-marroquin avatar Dec 19 '21 17:12 ivan-marroquin

@ivan-marroquin I think this should work--looks like the learning rate has an effect even when there is just one tree, and the way this interacts with NGBoost's learning rates might cause unexpected behavior.

learner = xgb.XGBRegressor(max_depth=3, n_estimators=1, learning_rate=1)
ngb_1 = ngboost.NGBRegressor(Base=learner)

tkzeng avatar Jun 11 '22 22:06 tkzeng