xgboost Multiple output regression

trafficstars

How do I perform multiple output regression? Or is it simply not possible?

My current assumption is that I would have to modify the code-base such that XGMatrix supports a matrix as labels and that I would have to create a custom objective function.

My end goal would be to perform regression to output two variables (a point) and to optimise euclidean loss. Would I be better off to make two seperate models (one for x coordinates and one for y coordinates).

Or... would I be better off using a random forest regressor within sklearn or some other alternative algorithm?

Mar 08 '17 04:03 miguelmartin75

Multivariate/multilabel regression is not currently implemented #574 #680 Tianqi had added some relevant placeholder data structures to gbtree learner, but no one had time, I guess, to work the machinery out.

Mar 11 '17 06:03 khotilov

Pity, since many competitions are with multi-outputs

Mar 13 '17 00:03 jindongwang

This would be a really nice feature to have.

May 10 '17 17:05 MarkusBonsch

Do we have any updates on this?

Sep 07 '18 04:09 joel-thomas-wilson

I'm adding this feature to the feature request tracker: #3439. Hopefully, we can get to it some point.

Sep 07 '18 18:09 hcho3

I agree - this feature would be extremely valuable (exactly what I need right now...)

Nov 06 '18 17:11 JacobKempster

I also agree, while this is quite trivial to do in neural nets, it would be nice to also be able to do this in xgboost.

Jan 31 '19 09:01 lenselinkbart

Would like to see this feature coming

Mar 26 '19 18:03 cp9612

any reason why it is closed?

Apr 15 '19 08:04 veonua

@veonua See #3439.

Apr 15 '19 08:04 hcho3

In the meanwhile there is any alternative, like any ensemble of single output models like:

# Fit a model and predict the lens values from the original features
model = XGBRegressor(n_estimators=2000, max_depth=20, learning_rate=0.01)
model = multioutput.MultiOutputRegressor(model)
model.fit(X_train, X_lens_train)
preds = model.predict(X_test)

from: https://gist.github.com/MLWave/4a3f8b0fee43d45646cf118bda4d202a

Sep 24 '19 16:09 loretoparisi

In the meanwhile there is any alternative, like any ensemble of single output models like:

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html

Sep 25 '19 03:09 jimmywan

I am going to also weigh in and say that having such feature would be extremely handy. The MultiOutputRegressor mentioned above is a nice wrapper to build multiple models at once and it does work well for predicting target variables that are independent from one another. However, if the target variables are highly correlated, then you really want to build one model that predicts a vector.

Jan 22 '20 15:01 cmottet

A year has passed soon since the last comment :-). This is why I want to repeat the wish to have such an interesting feature. I would be happy to see this. Thanks anyway for all your work.

Jan 07 '21 14:01 MxNl

Reopening for visibility.

Jan 21 '21 12:01 hcho3

Multivariate/multilabel regression is not currently implemented #574 #680 Tianqi had added some relevant placeholder data structures to gbtree learner, but no one had time, I guess, to work the machinery out.

Hello ,I have used the SckitLearn estimator and passed my script (.py)written for multioutput regression to the same and I could create endpoints. I have reffered following repo. https://github.com/qlanners/ml_deploy/tree/master/aws/scikit-learn/sklearn_estimators_locally. Changes done are :

Y= dataset.iloc[:,-3:] X = dataset.iloc[:,:-3] X_train, X_test , Y_train , Y_test =train_test_split(X, Y , test_size = 0.20,random_state =100)

gbr = GradientBoostingRegressor() modelMOR = MultiOutputRegressor(estimator=gbr) modelMOR.fit(X_train, Y_train)

Feb 04 '21 14:02 kk26269

The MultiOutputRegressor is bad alternative because it doesn't update eval_set dataset together with the main train (X, y) dataset.

Jul 22 '21 21:07 mirik123

I would love to spend some time on this ...

Jul 23 '21 08:07 trivialfis

I would love to spend some time on this ...

I have used this approach and it seems to work fine

https://github.com/dmlc/xgboost/issues/2087#issuecomment-534640535

Jul 23 '21 09:07 loretoparisi

Is there any update on this? Can we make it a joint effort to have the multioutput regression available. Irrespective of the independence modelling of several responses/y-variables, it would be great to have the xgb.DMatrix accept a list or a np.array with shape >1.

Sep 14 '21 09:09 StatMixedML

Created a PR for one-model-per-target implementation. https://github.com/dmlc/xgboost/pull/7309

It doesn't handle correlated targets, which requires vector leaf for the tree model. It's under the radar but needs more planning and refactoring.

Oct 11 '21 15:10 trivialfis

Note to myself: We should consider including the possibility of having independent early stopping for each target.

Oct 11 '21 19:10 trivialfis

The initial support is merged in https://github.com/dmlc/xgboost/pull/7514 . The feature is still quite primitive at the moment and is considered to be experimental. Thank you to everyone who participated in the thread.

Dec 18 '21 01:12 trivialfis

The initial support is merged in #7514 . The feature is still quite primitive at the moment and is considered to be experimental. Thank you to everyone who participated in the thread.

I am trying to use a custom objective with the multiple output regressor. Could you comment about the input and output shapes the custom objective? The following seems to work in demo/guide-python/multioutput_regression.py:

def pseudo_huber_error(y_true, y_pred):
    y_true = y_true.reshape(y_pred.shape)
    z = y_pred - y_true
    scale = 1 + z**2
    scale_sqrt = np.sqrt(scale)
    grad = z / scale_sqrt
    hess = 1 / (scale * scale_sqrt)
    return grad.flatten(), hess.flatten()

therefore, y_true.shape=(200,) , y_pred.shape=(100, 2) and grad.flatten().shape=(200,) Is that correct?

Jan 14 '22 12:01 giorizzi

Let me add the parameter in pseudo huber instead.

Jan 14 '22 14:01 trivialfis

Related https://github.com/dmlc/xgboost/issues/4840

Mar 31 '22 23:03 trivialfis

Based on the above discussion, I have extended the univariate XGBoostLSS to a multivariate framework called Multi-Target XGBoostLSS Regression that models multiple targets and their dependencies in a probabilistic regression setting. Code follows soon.

Oct 14 '22 13:10 StatMixedML

Thank you for sharing! Would love to read the paper this weekend.

Oct 15 '22 04:10 trivialfis

@trivialfis Thanks for making the multi-output feature available in the fist place!

Would be interested in your feedback, especially on how to improve the runtime for high-dimensional responses. Problem is the known scaling issue of XGBoost for multi-class and multi-output responses, since for each target, a separate tree is grown. Can we change the way xgboost is trained?

Oct 15 '22 10:10 StatMixedML

Can we change the way xgboost is trained?

Yes. I made it work with the exact and approx tree method (hist is very similar) in my prototype branches. I will focus on approx and hist in the future. One problem with approx (and hist) is that the histogram we build needs to account for all targets. Consider a histogram with 256 bins and a 64x64 image as both input and output (encoder-decoder alike), it will have 64^4 * 256 bins. For a small number of targets, this is perfectly fine. However, once the number of targets goes up we will have challenges in training the model efficiently.

I think there's a paper for addressing the issue by defining a different type of gradient to approximate the gain function further. The new type of gradient is a form of a weighted sum of gradients from all targets. I can't recall the name of that paper off top of my head, @jameslamb might have better insight since there was a wip PR for lightgbm for implementation by the authors. It's not a perfect solution (correct me if I'm wrong, haven't really dived into it yet, apologies), since in the end we still need to calculate the full gradient for the leaf value and the approximation doesn't seem to have any error bound, but it's a very good start for investigating the issue.

I started reading your paper today, the review of others' work is very extensive and I'm sure there is a lot of catching up I need to do to follow up on the latest development. Will come back to this once I can get the 1.7 out. Multi-target and probabilistic forecasting are exciting topics I would love to learn more about.

Oct 16 '22 06:10 trivialfis

xgboost xgboost copied to clipboard

Multiple output regression

xgboost
xgboost copied to clipboard