random_forest_run icon indicating copy to clipboard operation
random_forest_run copied to clipboard

Predictions in Log-Cost-Space

Open mlindauer opened this issue 8 years ago • 3 comments

I hope that I correctly remember our discussion yesterday about the predictions in log-cost space since I forgot my notes in my office. @frank-hutter if anything is wrong, please correct me.

@sfalkner Frank explained yesterday how he implemented the prediction in log(cost) space in SMAC and I don't know whether this is right now possible with the new RF. I hope you can please help us here.

  • Train the RF using log(cost) values
  • to get a marginalized prediction over instances
    1. compute a marginalized prediction for each tree using exp(log(cost)) of all values in the leafs -> one prediction for each tree in the original cost space
    2. mean and variance over all log(pred_t) for each t (so, again in the log-space)

How can we compute this with the RF? Is it possible using the python interface? Would it be inefficient to do it in Python? Can it be done within C++?

mlindauer avatar Apr 04 '17 06:04 mlindauer

IIRC, you marginalize over instances yourself right now anyway. You can use the all_leaf_values method to get the actual values stored in the corresponding leaf of each tree and iterate over that to compute anything you like. I don't quite understand why you want to have the final mean and variance prediction in log-space again, though. I don't think you would gain a lot doing this in C++. If that is a major use case and you want to do a lot of predictions with that, it would be more efficient to handle the transforms during fitting such that the marginalization is fast. That would require some C++ coding, but could be done if that turns out to improve your model quality and too slow in python. My concern still is the constant change from log to non-log space and how that affects the RF predictions. You still train it on log data, so you assume a log-normal distribution, but you want to marginalize the 'normal' values of which you will take the log again. I don't know if that's what you actually want...

sfalkner avatar Apr 04 '17 07:04 sfalkner

I forgot to mention, I updated the tests/pyrfr_example.py file to show how the all_leaf_values method is used.

sfalkner avatar Apr 04 '17 07:04 sfalkner

Thank you for the thorough explanation. Right now, my goal is to have reimplementation of the old SMAC. At some point, we should discuss with Frank whether this is the best way to do it or how we can evaluate alternatives.

mlindauer avatar Apr 04 '17 07:04 mlindauer