xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Multiple output regression

Open miguelmartin75 opened this issue 8 years ago • 26 comments
trafficstars

How do I perform multiple output regression? Or is it simply not possible?

My current assumption is that I would have to modify the code-base such that XGMatrix supports a matrix as labels and that I would have to create a custom objective function.

My end goal would be to perform regression to output two variables (a point) and to optimise euclidean loss. Would I be better off to make two seperate models (one for x coordinates and one for y coordinates).

Or... would I be better off using a random forest regressor within sklearn or some other alternative algorithm?

miguelmartin75 avatar Mar 08 '17 04:03 miguelmartin75

Multivariate/multilabel regression is not currently implemented #574 #680 Tianqi had added some relevant placeholder data structures to gbtree learner, but no one had time, I guess, to work the machinery out.

khotilov avatar Mar 11 '17 06:03 khotilov

Pity, since many competitions are with multi-outputs

jindongwang avatar Mar 13 '17 00:03 jindongwang

This would be a really nice feature to have.

MarkusBonsch avatar May 10 '17 17:05 MarkusBonsch

Do we have any updates on this?

joel-thomas-wilson avatar Sep 07 '18 04:09 joel-thomas-wilson

I'm adding this feature to the feature request tracker: #3439. Hopefully, we can get to it some point.

hcho3 avatar Sep 07 '18 18:09 hcho3

I agree - this feature would be extremely valuable (exactly what I need right now...)

JacobKempster avatar Nov 06 '18 17:11 JacobKempster

I also agree, while this is quite trivial to do in neural nets, it would be nice to also be able to do this in xgboost.

lenselinkbart avatar Jan 31 '19 09:01 lenselinkbart

Would like to see this feature coming

cp9612 avatar Mar 26 '19 18:03 cp9612

any reason why it is closed?

veonua avatar Apr 15 '19 08:04 veonua

@veonua See #3439.

hcho3 avatar Apr 15 '19 08:04 hcho3

In the meanwhile there is any alternative, like any ensemble of single output models like:

# Fit a model and predict the lens values from the original features
model = XGBRegressor(n_estimators=2000, max_depth=20, learning_rate=0.01)
model = multioutput.MultiOutputRegressor(model)
model.fit(X_train, X_lens_train)
preds = model.predict(X_test)

from: https://gist.github.com/MLWave/4a3f8b0fee43d45646cf118bda4d202a

loretoparisi avatar Sep 24 '19 16:09 loretoparisi

In the meanwhile there is any alternative, like any ensemble of single output models like:

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html

jimmywan avatar Sep 25 '19 03:09 jimmywan

I am going to also weigh in and say that having such feature would be extremely handy. The MultiOutputRegressor mentioned above is a nice wrapper to build multiple models at once and it does work well for predicting target variables that are independent from one another. However, if the target variables are highly correlated, then you really want to build one model that predicts a vector.

cmottet avatar Jan 22 '20 15:01 cmottet

A year has passed soon since the last comment :-). This is why I want to repeat the wish to have such an interesting feature. I would be happy to see this. Thanks anyway for all your work.

MxNl avatar Jan 07 '21 14:01 MxNl

Reopening for visibility.

hcho3 avatar Jan 21 '21 12:01 hcho3

Multivariate/multilabel regression is not currently implemented #574 #680 Tianqi had added some relevant placeholder data structures to gbtree learner, but no one had time, I guess, to work the machinery out.

Hello ,I have used the SckitLearn estimator and passed my script (.py)written for multioutput regression to the same and I could create endpoints. I have reffered following repo. https://github.com/qlanners/ml_deploy/tree/master/aws/scikit-learn/sklearn_estimators_locally. Changes done are :

Y= dataset.iloc[:,-3:] X = dataset.iloc[:,:-3] X_train, X_test , Y_train , Y_test =train_test_split(X, Y , test_size = 0.20,random_state =100)

gbr = GradientBoostingRegressor() modelMOR = MultiOutputRegressor(estimator=gbr) modelMOR.fit(X_train, Y_train)

kk26269 avatar Feb 04 '21 14:02 kk26269

The MultiOutputRegressor is bad alternative because it doesn't update eval_set dataset together with the main train (X, y) dataset.

mirik123 avatar Jul 22 '21 21:07 mirik123

I would love to spend some time on this ...

trivialfis avatar Jul 23 '21 08:07 trivialfis

I would love to spend some time on this ...

I have used this approach and it seems to work fine

https://github.com/dmlc/xgboost/issues/2087#issuecomment-534640535

loretoparisi avatar Jul 23 '21 09:07 loretoparisi

Is there any update on this? Can we make it a joint effort to have the multioutput regression available. Irrespective of the independence modelling of several responses/y-variables, it would be great to have the xgb.DMatrix accept a list or a np.array with shape >1.

StatMixedML avatar Sep 14 '21 09:09 StatMixedML

Created a PR for one-model-per-target implementation. https://github.com/dmlc/xgboost/pull/7309

It doesn't handle correlated targets, which requires vector leaf for the tree model. It's under the radar but needs more planning and refactoring.

trivialfis avatar Oct 11 '21 15:10 trivialfis

Note to myself: We should consider including the possibility of having independent early stopping for each target.

trivialfis avatar Oct 11 '21 19:10 trivialfis

The initial support is merged in https://github.com/dmlc/xgboost/pull/7514 . The feature is still quite primitive at the moment and is considered to be experimental. Thank you to everyone who participated in the thread.

trivialfis avatar Dec 18 '21 01:12 trivialfis

The initial support is merged in #7514 . The feature is still quite primitive at the moment and is considered to be experimental. Thank you to everyone who participated in the thread.

I am trying to use a custom objective with the multiple output regressor. Could you comment about the input and output shapes the custom objective? The following seems to work in demo/guide-python/multioutput_regression.py:

def pseudo_huber_error(y_true, y_pred):
    y_true = y_true.reshape(y_pred.shape)
    z = y_pred - y_true
    scale = 1 + z**2
    scale_sqrt = np.sqrt(scale)
    grad = z / scale_sqrt
    hess = 1 / (scale * scale_sqrt)
    return grad.flatten(), hess.flatten()

therefore, y_true.shape=(200,) , y_pred.shape=(100, 2) and grad.flatten().shape=(200,) Is that correct?

giorizzi avatar Jan 14 '22 12:01 giorizzi

Let me add the parameter in pseudo huber instead.

trivialfis avatar Jan 14 '22 14:01 trivialfis

Related https://github.com/dmlc/xgboost/issues/4840

trivialfis avatar Mar 31 '22 23:03 trivialfis

Based on the above discussion, I have extended the univariate XGBoostLSS to a multivariate framework called Multi-Target XGBoostLSS Regression that models multiple targets and their dependencies in a probabilistic regression setting. Code follows soon.

StatMixedML avatar Oct 14 '22 13:10 StatMixedML

Thank you for sharing! Would love to read the paper this weekend.

trivialfis avatar Oct 15 '22 04:10 trivialfis

@trivialfis Thanks for making the multi-output feature available in the fist place!

Would be interested in your feedback, especially on how to improve the runtime for high-dimensional responses. Problem is the known scaling issue of XGBoost for multi-class and multi-output responses, since for each target, a separate tree is grown. Can we change the way xgboost is trained?

StatMixedML avatar Oct 15 '22 10:10 StatMixedML

Can we change the way xgboost is trained?

Yes. I made it work with the exact and approx tree method (hist is very similar) in my prototype branches. I will focus on approx and hist in the future. One problem with approx (and hist) is that the histogram we build needs to account for all targets. Consider a histogram with 256 bins and a 64x64 image as both input and output (encoder-decoder alike), it will have 64^4 * 256 bins. For a small number of targets, this is perfectly fine. However, once the number of targets goes up we will have challenges in training the model efficiently.

I think there's a paper for addressing the issue by defining a different type of gradient to approximate the gain function further. The new type of gradient is a form of a weighted sum of gradients from all targets. I can't recall the name of that paper off top of my head, @jameslamb might have better insight since there was a wip PR for lightgbm for implementation by the authors. It's not a perfect solution (correct me if I'm wrong, haven't really dived into it yet, apologies), since in the end we still need to calculate the full gradient for the leaf value and the approximation doesn't seem to have any error bound, but it's a very good start for investigating the issue.

I started reading your paper today, the review of others' work is very extensive and I'm sure there is a lot of catching up I need to do to follow up on the latest development. Will come back to this once I can get the 1.7 out. Multi-target and probabilistic forecasting are exciting topics I would love to learn more about.

trivialfis avatar Oct 16 '22 06:10 trivialfis