xgboost
xgboost copied to clipboard
Multiple output regression
How do I perform multiple output regression? Or is it simply not possible?
My current assumption is that I would have to modify the code-base such that XGMatrix supports a matrix as labels and that I would have to create a custom objective function.
My end goal would be to perform regression to output two variables (a point) and to optimise euclidean loss. Would I be better off to make two seperate models (one for x coordinates and one for y coordinates).
Or... would I be better off using a random forest regressor within sklearn or some other alternative algorithm?
Multivariate/multilabel regression is not currently implemented #574 #680 Tianqi had added some relevant placeholder data structures to gbtree learner, but no one had time, I guess, to work the machinery out.
Pity, since many competitions are with multi-outputs
This would be a really nice feature to have.
Do we have any updates on this?
I'm adding this feature to the feature request tracker: #3439. Hopefully, we can get to it some point.
I agree - this feature would be extremely valuable (exactly what I need right now...)
I also agree, while this is quite trivial to do in neural nets, it would be nice to also be able to do this in xgboost.
Would like to see this feature coming
any reason why it is closed?
@veonua See #3439.
In the meanwhile there is any alternative, like any ensemble of single output models like:
# Fit a model and predict the lens values from the original features
model = XGBRegressor(n_estimators=2000, max_depth=20, learning_rate=0.01)
model = multioutput.MultiOutputRegressor(model)
model.fit(X_train, X_lens_train)
preds = model.predict(X_test)
from: https://gist.github.com/MLWave/4a3f8b0fee43d45646cf118bda4d202a
In the meanwhile there is any alternative, like any ensemble of single output models like:
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html
I am going to also weigh in and say that having such feature would be extremely handy. The MultiOutputRegressor mentioned above is a nice wrapper to build multiple models at once and it does work well for predicting target variables that are independent from one another. However, if the target variables are highly correlated, then you really want to build one model that predicts a vector.
A year has passed soon since the last comment :-). This is why I want to repeat the wish to have such an interesting feature. I would be happy to see this. Thanks anyway for all your work.
Reopening for visibility.
Multivariate/multilabel regression is not currently implemented #574 #680 Tianqi had added some relevant placeholder data structures to gbtree learner, but no one had time, I guess, to work the machinery out.
Hello ,I have used the SckitLearn estimator and passed my script (.py)written for multioutput regression to the same and I could create endpoints. I have reffered following repo. https://github.com/qlanners/ml_deploy/tree/master/aws/scikit-learn/sklearn_estimators_locally. Changes done are :
Y= dataset.iloc[:,-3:] X = dataset.iloc[:,:-3] X_train, X_test , Y_train , Y_test =train_test_split(X, Y , test_size = 0.20,random_state =100)
gbr = GradientBoostingRegressor() modelMOR = MultiOutputRegressor(estimator=gbr) modelMOR.fit(X_train, Y_train)
The MultiOutputRegressor is bad alternative because it doesn't update eval_set dataset together with the main train (X, y) dataset.
I would love to spend some time on this ...
I would love to spend some time on this ...
I have used this approach and it seems to work fine
https://github.com/dmlc/xgboost/issues/2087#issuecomment-534640535
Is there any update on this? Can we make it a joint effort to have the multioutput regression available. Irrespective of the independence modelling of several responses/y-variables, it would be great to have the xgb.DMatrix accept a list or a np.array with shape >1.
Created a PR for one-model-per-target implementation. https://github.com/dmlc/xgboost/pull/7309
It doesn't handle correlated targets, which requires vector leaf for the tree model. It's under the radar but needs more planning and refactoring.
Note to myself: We should consider including the possibility of having independent early stopping for each target.
The initial support is merged in https://github.com/dmlc/xgboost/pull/7514 . The feature is still quite primitive at the moment and is considered to be experimental. Thank you to everyone who participated in the thread.
The initial support is merged in #7514 . The feature is still quite primitive at the moment and is considered to be experimental. Thank you to everyone who participated in the thread.
I am trying to use a custom objective with the multiple output regressor.
Could you comment about the input and output shapes the custom objective?
The following seems to work in demo/guide-python/multioutput_regression.py:
def pseudo_huber_error(y_true, y_pred):
y_true = y_true.reshape(y_pred.shape)
z = y_pred - y_true
scale = 1 + z**2
scale_sqrt = np.sqrt(scale)
grad = z / scale_sqrt
hess = 1 / (scale * scale_sqrt)
return grad.flatten(), hess.flatten()
therefore, y_true.shape=(200,) , y_pred.shape=(100, 2) and grad.flatten().shape=(200,)
Is that correct?
Let me add the parameter in pseudo huber instead.
Related https://github.com/dmlc/xgboost/issues/4840
Based on the above discussion, I have extended the univariate XGBoostLSS to a multivariate framework called Multi-Target XGBoostLSS Regression that models multiple targets and their dependencies in a probabilistic regression setting. Code follows soon.
Thank you for sharing! Would love to read the paper this weekend.
@trivialfis Thanks for making the multi-output feature available in the fist place!
Would be interested in your feedback, especially on how to improve the runtime for high-dimensional responses. Problem is the known scaling issue of XGBoost for multi-class and multi-output responses, since for each target, a separate tree is grown. Can we change the way xgboost is trained?
Can we change the way xgboost is trained?
Yes. I made it work with the exact and approx tree method (hist is very similar) in my prototype branches. I will focus on approx and hist in the future. One problem with approx (and hist) is that the histogram we build needs to account for all targets. Consider a histogram with 256 bins and a 64x64 image as both input and output (encoder-decoder alike), it will have 64^4 * 256 bins. For a small number of targets, this is perfectly fine. However, once the number of targets goes up we will have challenges in training the model efficiently.
I think there's a paper for addressing the issue by defining a different type of gradient to approximate the gain function further. The new type of gradient is a form of a weighted sum of gradients from all targets. I can't recall the name of that paper off top of my head, @jameslamb might have better insight since there was a wip PR for lightgbm for implementation by the authors. It's not a perfect solution (correct me if I'm wrong, haven't really dived into it yet, apologies), since in the end we still need to calculate the full gradient for the leaf value and the approximation doesn't seem to have any error bound, but it's a very good start for investigating the issue.
I started reading your paper today, the review of others' work is very extensive and I'm sure there is a lot of catching up I need to do to follow up on the latest development. Will come back to this once I can get the 1.7 out. Multi-target and probabilistic forecasting are exciting topics I would love to learn more about.