FLAML
FLAML copied to clipboard
BlendSearch: How do config constraints affect search performance?
As I can read in the docs, when passing a valid argument for the config_constraints
parameter of flaml.tune.run
, if a particular hyperparameter configuration violates any of the constraints imposed, its evaluation is prevented so as to avoid wasting processing time, which is great.
However, when looking in the BlendSearch code I can see that it assigns an infinite loss to that configuration.
So, from a theoretical standpoint, given that the global search is based on a regressor which has a smoothness prior (I think you mentioned BlendSearch using a Tree-structured Parzen Estimator, which, as per my limited understanding of the paper uses truncated Gaussian mixture models), wouldn't that cause configuration that are topologically adjacent to the edge of the feasible region to also be estimated to have very low performance and, thus, decrease the probability for them to be sampled?
I mean, I am aware that robust regressors should have some capacity to model kinks (discontinuities) in the surface they are estimating, but I wanted to ask how much of an extra cost would be incurred for the edge configurations to be taken into account, especially since it's possible that for some problems the optima have a high chance to lie right on the edge of feasibility.
On a sidenote, I can see that Luigi Nardi's HyperMapper package trains a separate classification model used for anticipating whether a particular parameter configuration will yield valid results (mentioned, e.g., here). Now, I believe they apply it for optimizing run time when metric_constraints
are used (i.e., avoid running evaluations that you're sure will turn out to violate the constraints), but I can see the advantage of using a separate model in that it prevents contaminating the error surface model with artificial data points that stem from feasibility considerations.
Also, I noticed when enabling such constraints in my own model, for a very restricted run time, that unfeasible hyperparameter configurations are only tried right at the beginning of the optimization session, and thus, I believe it's possible that the optimizer has learned to avoid them altogether. The question in the main post above would, thus, translate to whether a too aggressive avoidance strategy for unfeasible configs also has the side effect to discourage sampling on the edge of feasibility.
This potential issue can be circumvented up to a certain degree using conditional sampling, but that's not very handy either. (#816, #988). E.g., I've been trying to enforce an order (inequality) constraint between two randint
s, and it's not very easy without the potentially harmful config_constraints
and without adding categorical choice
s for each possible value (categorical choice
ignores topology, too, which is not helpful).
As I can read in the docs, when passing a valid argument for the
config_constraints
parameter offlaml.tune.run
, if a particular hyperparameter configuration violates any of the constraints imposed, its evaluation is prevented so as to avoid wasting processing time, which is great.However, when looking in the BlendSearch code I can see that it assigns an infinite loss to that configuration.
So, from a theoretical standpoint, given that the global search is based on a regressor which has a smoothness prior (I think you mentioned BlendSearch using a Tree-structured Parzen Estimator, which, as per my limited understanding of the paper uses truncated Gaussian mixture models), wouldn't that cause configuration that are topologically adjacent to the edge of the feasible region to also be estimated to have very low performance and, thus, decrease the probability for them to be sampled?
I mean, I am aware that robust regressors should have some capacity to model kinks (discontinuities) in the surface they are estimating, but I wanted to ask how much of an extra cost would be incurred for the edge configurations to be taken into account, especially since it's possible that for some problems the optima have a high chance to lie right on the edge of feasibility.
BlendSearch uses both model-based and model-free search. The model-free search won't be discouraged to sample near the edge. Have you observed any undesirable behavior when using config_constraints
?
Well, it's hard for me to measure on my problem where I don't know whether or not good solutions lie on the feasibility edge, and that's why I'm asking - because this kind of issue should be more apparent in a controlled environment, where the optima of the function to be optimized are known beforehand, and perhaps function evaluation is not as expensive.
In all fairness, the point you're making is correct - the local search might somewhat alleviate a potential shortcoming of the global search in this respect. However, I'm asking whether this hypothesis (i.e., that the local search is able to address this problem sufficiently) has been verified in practice, perhaps on a problem designed to highlight this capability (maybe within the research paper).
To provide some context:
I'm trying to sample from a space which looks, for our intents and purpose, something like "sample random integers a
and b
between 0 and 10 such that a < b
". Now, I've tried several ways to do this, but to little avail:
- using
config_constraints
- I'm a bit concerned about performance (as explained in this issue), as I don't want to avoid, e.g., trying higha
s or lowb
s independently altogether, but just in conjunction, and I'm not sure that the TPE is able to model this efficiently - a conditional search space like
'b': flaml.tune.randint(1, 10), 'a': ray.tune.sample_from(lambda spec: flaml.tune.randint(0, b))
(this can be rewritten alternatively to samplea
between 0 and 9 andb
betweena
and 10); this crashes as explained in this comment of #988. - sample
b
between 1 and 10, and then samplea
as a float between 0 and 1 that should serve as a multiplier, i.e., we obtain the final value fora
by multiplying the sampled float withb
, and then rounding it to the nearest integer. However, this also has the drawback that multiple float values are mapped to the same integer, so you end up with many duplicate experiments. One way to alleviate this would be to store results for rounded values inray
's object storage (i.e., usingray.put
andray.get
), but that wouldn't remove the problem in its totality, because nothing prevents starting an identical trial (in parallel) before the original trial finishes.
So, if you, maybe have a different suggestion on how to do this properly, please give me an idea, as I'm all out :).
Perhaps I should try passing a custom pre-defined ray.OptunaSearcher
instance, with the embedded space defined in terms of clean ray.rune
APIs instead of using flaml.tune
.
Well, it's hard for me to measure on my problem where I don't know whether or not good solutions lie on the feasibility edge, and that's why I'm asking - because this kind of issue should be more apparent in a controlled environment, where the optima of the function to be optimized are known beforehand, and perhaps function evaluation is not as expensive.
In all fairness, the point you're making is correct - the local search might somewhat alleviate a potential shortcoming of the global search in this respect. However, I'm asking whether this hypothesis (i.e., that the local search is able to address this problem sufficiently) has been verified in practice, perhaps on a problem designed to highlight this capability (maybe within the research paper).
To provide some context:
I'm trying to sample from a space which looks, for our intents and purpose, something like "sample random integers
a
andb
between 0 and 10 such thata < b
". Now, I've tried several ways to do this, but to little avail:
- using
config_constraints
- I'm a bit concerned about performance (as explained in this issue), as I don't want to avoid, e.g., trying higha
s or lowb
s independently altogether, but just in conjunction, and I'm not sure that the TPE is able to model this efficiently- a conditional search space like
'b': flaml.tune.randint(1, 10), 'a': ray.tune.sample_from(lambda spec: flaml.tune.randint(0, b))
(this can be rewritten alternatively to samplea
between 0 and 9 andb
betweena
and 10); this crashes as explained in this comment of flaml tune support for ray.tune.sample_from #988.- sample
b
between 1 and 10, and then samplea
as a float between 0 and 1 that should serve as a multiplier, i.e., we obtain the final value fora
by multiplying the sampled float withb
, and then rounding it to the nearest integer. However, this also has the drawback that multiple float values are mapped to the same integer, so you end up with many duplicate experiments. One way to alleviate this would be to store results for rounded values inray
's object storage (i.e., usingray.put
andray.get
), but that wouldn't remove the problem in its totality, because nothing prevents starting an identical trial (in parallel) before the original trial finishes.So, if you, maybe have a different suggestion on how to do this properly, please give me an idea, as I'm all out :).
config_constraints
is indeed used in the experiments in this paper: FLAML: A Fast and Lightweight AutoML Library. CFO is used as the search method, not TPE.