syne-tune Refactor surrogates in blackbox repository

Currently, surrogates may return inconsistent metric curves (e.g., elapsed_time not monotonic w.r.t. fidelity). It is also unclear how seed is treated in a surrogate.

Will use multi-variate regression natively supported in scikit-learn. We currently already use that w.r.t. num_objectives. The input of the model will be the HP config only. The old way can still be used, but won't be the default.

Will also sort out the situation with seed.

Sep 16 '22 10:09 mseeger

Could you explain why multivariate regression would solve the monotonicity? This part is not clear to me.

Regarding the seed, this information is not used in the sense that all evaluations are used to estimate the surrogate (which is a point predictor at the moment). I am also not sure what you mean by sorting out the situation with the seed.

Sep 16 '22 12:09 geoalgo

Hi David, multivariate regression as built into sklearn (NOT one regressor per output) maps x to vectors y, using some forest over trees. In the leafs of trees, you have a number of y_i's, and it predicts the average of those. The tree is built by splitting the data w.r.t. attributes of x, but using a distance between y vectors (likely squared norm).

If a property holds for all y_i's and is retained under convex combination, it also holds for all predictions. And monotonicity is such a property, positivity as well.

This is also cheaper, because the number of datapoints is not a multiple of fidelities, so subsampling is not needed.

It may, of course, also work less well, this is why I am leaving all the current code in there, so folks can choose.

Sep 16 '22 13:09 mseeger

As for seed: Yes, I see, you merge data across seeds in order to fit the surrogate to all that. I am just putting in a choice to keep seeds separate. But the default will be what it is right now.

Retaining seeds in the surrogate is useful in order to replicate the variations coming in through different seeds (as each trial typically picks a different seed).

Sep 16 '22 13:09 mseeger

The current code already uses multivariate regression w.r.t. num_objectives, so the y in fit is already a matrix with >1 column So this should all work, also with XGBoost.

Sep 16 '22 13:09 mseeger

This is actually pretty elegant code in BlackboxSurrogate

Sep 16 '22 13:09 mseeger

Thanks for the explanation, I guess you meant to use a specific regressor such as a tree method, then I agree you would have some guarantee (it does not hold true if we would use an MLP, this is what I did not understood).

Regarding the seed, I am not sure I got what you mean. Currently, all data points are put together in the supervised dataset so if you have two seeds, you would have two training examples. Do you mean to change the estimation problem so that a map from num_hyperparameter_dim to num_objectives x num_seeds is learned?

It seems to me that if we would include seeds in surrogate, then we should have probabilistic models that samples from a distribution when being queried.

Sep 16 '22 14:09 geoalgo

No, the alternative is to fit one model per seed, only on the data for that seed. If you have 4 seeds, you get 4 models, each trained on 1/4 of the complete data. But merging the data across seeds will still be the default.

Sep 16 '22 14:09 mseeger

Of-course, makes sense thanks. I also think that merging seeds should be default for efficiency reasons.

Sep 16 '22 15:09 geoalgo

You are right, MLP does not have that property.

Sep 16 '22 17:09 mseeger

This is a bit stuck. I discovered that if benchmark_dehb experiments with lcbench are repeated with RandomForestRegressor instead of 1-NN, results are very poor.

If [old], [pc=True], [pc=False] denote old code and new code with predict_curves=X, then:

1-NN does the same for all 3 cases, but [pc=True] runs faster
RandomForestRegressor does the same for [old], [pc=False], but is even worse for [pc=True] (but faster)

Sep 19 '22 13:09 mseeger

TODO: Need to first understand and fix issues with RandomForestRegressor.

Sep 19 '22 13:09 mseeger

One simple thing to try is to map elapsed_time -> time_per_resource before fitting a model, and reverse after prediction. This curve should be easier to fit for methods that rely on targets in order to split up the input space.

Sep 20 '22 06:09 mseeger

OK, I implemented the mapping elapsed_time -> time_per_resource. Results for RandomForestRegressor are better than without that (quite a bit), but results with 1-NN are still quite a bit better.

I leave this for now, but this clearly needs further investigation. It may even be that the task becomes too simple with 1-NN?

Sep 26 '22 08:09 mseeger

Relabel this one, as it is not a bug, but seems to be a general issue with surrogates. I leave this one open, because there are still some things I'd like to do here.

Oct 12 '22 22:10 mseeger

OK, PR #405 is fixing the most obvious problems, while in general, we need to be careful with the accuracy of surrogates

Nov 14 '22 10:11 mseeger

syne-tune syne-tune copied to clipboard

Refactor surrogates in blackbox repository

syne-tune
syne-tune copied to clipboard