mGBDT
mGBDT copied to clipboard
Performance of your model on regression tasks
Description
@kingfengji Thanks for making the code available. I believe that multi-layered decision trees is a very elegant and powerful approach! I was applying your model to the boston housing dataset but wasn't able to outperform a baseline xgboost model.
Details
To compare your approach to several alternatives, I ran a small benchmark study using the following approaches, where all models have the same hyper-parameters
- baseline xgboost model (xgboost)
- mGBDT with xgboost for hidden and output layer (mGBDT_XGBoost)
- mGBDT with xgboost for hidden but with linear model for output layer (mGBDT_Linear)
- linear model as implemented here (Linear)
I am using PyTorch's L1Loss
for model training and use the MAE
for evaluation, where all models are trained in serial mode. Results are as follows
In particular, I observe the following
- irresepective of the hyper-parameters and number of epochs, a basline xgboost model tends to outperforms your approach
- with increasing number of epochs, the runtime for an epoch increases considerably. Any idea as to why this happens?
- using mGBDT_Linear,
- I wasn't able to use PyTorch's
MSELoss
since the loss exploded after some iterations, even after normalizingX
. Should we, similar to Neural Networks, also scaley
to avoid exploding gradients? - the training loss starts at exceptionally high values, then decreases before it starts to increase again
- I wasn't able to use PyTorch's
Additional Questions
- Given that you have mostly been using your approach for classification tasks, is there anything we need to change before we use it for regression tasks, except the
PyTorch Loss
? - Besides the loss of
F
, can we also track how well the target propagation is working by evaluating the reconstruction loss ofG
? - When using mGBDT with a linear output layer, would we expect to generally see better results compared to using xgboost for the output layer?
- What is the benefit of using a linear output layer compared to a xgboost layer?
- For training
F
andG
, you are currently using theMSELoss
for the xgboost models. Do you have some experience with modifying this loss? - What is the effect of the number of iterations for initializing the model before training?
- What is the relationship between the number of boosting iterations (for xgboost training) and the number of epochs (for MGBDT training)?
- In Section 4 of your paper you state "The experiments for this section is mainly designed to empirically examine if it is feasible to jointly train the multi-layered structure proposed by this work. That is, we make no claims that the current structure can outperform CNNs in computer vision tasks." So as a question, would that mean that your intention is not to outperform existing Deep Learning based models, say CNN, or to outperform existing GBM-models, like XGBoost, but rather to show that a Decision Tree based model can be also used for learning meaningful representations that can then be used for downstreaming tasks?
- Connected to the previous question: Gradient boosting models are already very strong learners that obtain very good results in many applications. So what would be your motivation of using multiple layers of such a model? May it even happen that, based on the implicit error correction mechanism of GBM, training several of them leads to a drop in accuracy?
Code
To reproduce the results, you an use the attached notebook.
@kingfengji I would highly appreciate your feedback. Many thanks.