FLAML icon indicating copy to clipboard operation
FLAML copied to clipboard

BlendSearch: How do config constraints affect search performance?

Open bbudescu opened this issue 1 year ago • 6 comments

As I can read in the docs, when passing a valid argument for the config_constraints parameter of flaml.tune.run, if a particular hyperparameter configuration violates any of the constraints imposed, its evaluation is prevented so as to avoid wasting processing time, which is great.

However, when looking in the BlendSearch code I can see that it assigns an infinite loss to that configuration.

So, from a theoretical standpoint, given that the global search is based on a regressor which has a smoothness prior (I think you mentioned BlendSearch using a Tree-structured Parzen Estimator, which, as per my limited understanding of the paper uses truncated Gaussian mixture models), wouldn't that cause configuration that are topologically adjacent to the edge of the feasible region to also be estimated to have very low performance and, thus, decrease the probability for them to be sampled?

I mean, I am aware that robust regressors should have some capacity to model kinks (discontinuities) in the surface they are estimating, but I wanted to ask how much of an extra cost would be incurred for the edge configurations to be taken into account, especially since it's possible that for some problems the optima have a high chance to lie right on the edge of feasibility.

bbudescu avatar Apr 04 '23 10:04 bbudescu

On a sidenote, I can see that Luigi Nardi's HyperMapper package trains a separate classification model used for anticipating whether a particular parameter configuration will yield valid results (mentioned, e.g., here). Now, I believe they apply it for optimizing run time when metric_constraints are used (i.e., avoid running evaluations that you're sure will turn out to violate the constraints), but I can see the advantage of using a separate model in that it prevents contaminating the error surface model with artificial data points that stem from feasibility considerations.

Also, I noticed when enabling such constraints in my own model, for a very restricted run time, that unfeasible hyperparameter configurations are only tried right at the beginning of the optimization session, and thus, I believe it's possible that the optimizer has learned to avoid them altogether. The question in the main post above would, thus, translate to whether a too aggressive avoidance strategy for unfeasible configs also has the side effect to discourage sampling on the edge of feasibility.

bbudescu avatar Apr 04 '23 13:04 bbudescu

This potential issue can be circumvented up to a certain degree using conditional sampling, but that's not very handy either. (#816, #988). E.g., I've been trying to enforce an order (inequality) constraint between two randints, and it's not very easy without the potentially harmful config_constraints and without adding categorical choices for each possible value (categorical choice ignores topology, too, which is not helpful).

bbudescu avatar Apr 10 '23 11:04 bbudescu

As I can read in the docs, when passing a valid argument for the config_constraints parameter of flaml.tune.run, if a particular hyperparameter configuration violates any of the constraints imposed, its evaluation is prevented so as to avoid wasting processing time, which is great.

However, when looking in the BlendSearch code I can see that it assigns an infinite loss to that configuration.

So, from a theoretical standpoint, given that the global search is based on a regressor which has a smoothness prior (I think you mentioned BlendSearch using a Tree-structured Parzen Estimator, which, as per my limited understanding of the paper uses truncated Gaussian mixture models), wouldn't that cause configuration that are topologically adjacent to the edge of the feasible region to also be estimated to have very low performance and, thus, decrease the probability for them to be sampled?

I mean, I am aware that robust regressors should have some capacity to model kinks (discontinuities) in the surface they are estimating, but I wanted to ask how much of an extra cost would be incurred for the edge configurations to be taken into account, especially since it's possible that for some problems the optima have a high chance to lie right on the edge of feasibility.

BlendSearch uses both model-based and model-free search. The model-free search won't be discouraged to sample near the edge. Have you observed any undesirable behavior when using config_constraints ?

sonichi avatar Apr 10 '23 14:04 sonichi

Well, it's hard for me to measure on my problem where I don't know whether or not good solutions lie on the feasibility edge, and that's why I'm asking - because this kind of issue should be more apparent in a controlled environment, where the optima of the function to be optimized are known beforehand, and perhaps function evaluation is not as expensive.

In all fairness, the point you're making is correct - the local search might somewhat alleviate a potential shortcoming of the global search in this respect. However, I'm asking whether this hypothesis (i.e., that the local search is able to address this problem sufficiently) has been verified in practice, perhaps on a problem designed to highlight this capability (maybe within the research paper).

To provide some context:

I'm trying to sample from a space which looks, for our intents and purpose, something like "sample random integers a and b between 0 and 10 such that a < b". Now, I've tried several ways to do this, but to little avail:

  • using config_constraints - I'm a bit concerned about performance (as explained in this issue), as I don't want to avoid, e.g., trying high as or low bs independently altogether, but just in conjunction, and I'm not sure that the TPE is able to model this efficiently
  • a conditional search space like 'b': flaml.tune.randint(1, 10), 'a': ray.tune.sample_from(lambda spec: flaml.tune.randint(0, b)) (this can be rewritten alternatively to sample a between 0 and 9 and b between a and 10); this crashes as explained in this comment of #988.
  • sample b between 1 and 10, and then sample a as a float between 0 and 1 that should serve as a multiplier, i.e., we obtain the final value for a by multiplying the sampled float with b, and then rounding it to the nearest integer. However, this also has the drawback that multiple float values are mapped to the same integer, so you end up with many duplicate experiments. One way to alleviate this would be to store results for rounded values in ray's object storage (i.e., using ray.put and ray.get), but that wouldn't remove the problem in its totality, because nothing prevents starting an identical trial (in parallel) before the original trial finishes.

So, if you, maybe have a different suggestion on how to do this properly, please give me an idea, as I'm all out :).

bbudescu avatar Apr 11 '23 10:04 bbudescu

Perhaps I should try passing a custom pre-defined ray.OptunaSearcher instance, with the embedded space defined in terms of clean ray.rune APIs instead of using flaml.tune.

bbudescu avatar Apr 11 '23 11:04 bbudescu

Well, it's hard for me to measure on my problem where I don't know whether or not good solutions lie on the feasibility edge, and that's why I'm asking - because this kind of issue should be more apparent in a controlled environment, where the optima of the function to be optimized are known beforehand, and perhaps function evaluation is not as expensive.

In all fairness, the point you're making is correct - the local search might somewhat alleviate a potential shortcoming of the global search in this respect. However, I'm asking whether this hypothesis (i.e., that the local search is able to address this problem sufficiently) has been verified in practice, perhaps on a problem designed to highlight this capability (maybe within the research paper).

To provide some context:

I'm trying to sample from a space which looks, for our intents and purpose, something like "sample random integers a and b between 0 and 10 such that a < b". Now, I've tried several ways to do this, but to little avail:

  • using config_constraints - I'm a bit concerned about performance (as explained in this issue), as I don't want to avoid, e.g., trying high as or low bs independently altogether, but just in conjunction, and I'm not sure that the TPE is able to model this efficiently
  • a conditional search space like 'b': flaml.tune.randint(1, 10), 'a': ray.tune.sample_from(lambda spec: flaml.tune.randint(0, b)) (this can be rewritten alternatively to sample a between 0 and 9 and b between a and 10); this crashes as explained in this comment of flaml tune support for ray.tune.sample_from #988.
  • sample b between 1 and 10, and then sample a as a float between 0 and 1 that should serve as a multiplier, i.e., we obtain the final value for a by multiplying the sampled float with b, and then rounding it to the nearest integer. However, this also has the drawback that multiple float values are mapped to the same integer, so you end up with many duplicate experiments. One way to alleviate this would be to store results for rounded values in ray's object storage (i.e., using ray.put and ray.get), but that wouldn't remove the problem in its totality, because nothing prevents starting an identical trial (in parallel) before the original trial finishes.

So, if you, maybe have a different suggestion on how to do this properly, please give me an idea, as I'm all out :).

config_constraints is indeed used in the experiments in this paper: FLAML: A Fast and Lightweight AutoML Library. CFO is used as the search method, not TPE.

sonichi avatar Jun 04 '23 00:06 sonichi