FLAML icon indicating copy to clipboard operation
FLAML copied to clipboard

Documentation: penalty for metric constraint violation

Open bbudescu opened this issue 11 months ago • 5 comments

FLAML offers metric_constraints, which are supposed to guide the optimizer towards narrowing (as much as possible) the search space to a region specified by the user. Here appears to be the main place in the code where metric constraint handling is implemented.

As per my understanding of the code, this is the basic idea of the algorithm:

  • the optimizer is driven to stay away from unfeasible regions by applying a very high penalty on the metric that is used as an optimization objective
  • it's desirable for the penalty to be proportional to the distance to the feasible region, so that even outside of the bounding box there's still a gradient for the optimizer to learn to follow, so that the optimization process leads towards within the user-specified bounds
  • because the constrained metrics can have different magnitudes, and they are not known beforehand, the penalty for each metric should be updated based on observed metric values until it proves to provide enough of an incentive for the optimizer to avoid the unfeasible region

To this end, I think the steps taken look something like this (as much as I could understand the code):

  • in the beginning, until the first combo is found that satisfies the constraints, a huge penalty (D x 10^10) is applied on the objective value
  • as per the above formula, the penalty is also proportional to D which represents by how much the user-specified thresholds were missed (D is the difference between the metric value reported as part of the results of a trial and the constraint the user specified for that metric)
  • once a feasible config is found (i.e., one that doesn't violate any constraints), the penalty coefficient is reduced from 10^10 to 1. This means that the score of the next unfeasible combo after some feasible ones have been found will only be penalized by D1 instead of D1 x 10^10
  • every time a constraint is violated, the penalty is increased by D, so the score of the second unfeasible combo will be penalized by (1 + D1) x D2, the score of the third one, by (1 + D1 + D2) x D3, the score of the fourth one by (1 + D1 + D2 + D3) x D4 and so on. The penalty is increased until 10^10 is reached again

Now, the above procedure at least looks interesting and seems to follow sound desiderata, but has anyone actually verified empirically that it works? Is there some theoretical proof that it converges to the feasible regions? Is there a risk for the penalties to explode and get too large? How does this affect the affinity of the optimizer towards exploring configurations on the edge of the feasibility (which might well be the best performing ones)?

In other words, is there a scientific paper in which this procedure is described into further detail? Or at least some place in the docs where it is addressed?

bbudescu avatar Aug 01 '23 14:08 bbudescu

I'm glad this question is asked. I implemented this some time ago. I spent some effort in the design to enable blackbox constraint in our new BlendSearch method. Some idea is inspired by ADMM, such as the increase of the penalty. This is my best effort in integrating those ideas into our framework in a practical way. One empirical study is in https://arxiv.org/pdf/2208.02922.pdf, in which @YIWEI-CHEN also demonstrates that when combined with early stopping, the performance can be further improved. It'll be nice to do more evaluation or theoretical work.

sonichi avatar Aug 01 '23 15:08 sonichi

It's really cool that you were able to write a paper on this.

Have you, by any chance, also been able to compare your method against the one employed by HyperMapper (here's an example, and check params scalarization_method and weight sampling under this section of the docs)? They say that it's based on this paper, which, in turn, cites this one with respect to the issue of focusing search on a bounding box within metric space (not sure, but there's a chance the now defunct Dragonfly library also implemented the same methods. Dragonfly was coded by authors of the first paper above). Also, I think something similar is known as SAMO (this should provide a summary).

Now, the approach looks a bit different, as it is expressed within the framework of scalarization (combining multiple metrics into a single one), which is done by using some combination (linear or not) of individual metric values which have been beforehand assigned a set of weights. The weights are chosen randomly, but distributions they are sampled from are modulated by the user thresholds. I didn't understand their papers in detail, but I think the idea should still be similar, in that configs violating the constraints probably end up being assigned a value signifying low performance. Not sure which of them helps the optimizer converge to the optimum faster, though.

bbudescu avatar Aug 01 '23 16:08 bbudescu

One empirical study is in https://arxiv.org/pdf/2208.02922.pdf, in which @YIWEI-CHEN also demonstrates that when combined with early stopping, the performance can be further improved.

Is this implemented in flaml? Perhaps it's the native flaml scheduler?

bbudescu avatar Aug 02 '23 11:08 bbudescu

One empirical study is in https://arxiv.org/pdf/2208.02922.pdf, in which @YIWEI-CHEN also demonstrates that when combined with early stopping, the performance can be further improved.

Is this implemented in flaml? Perhaps it's the native flaml scheduler?

It's not implemented in flaml yet. @YIWEI-CHEN might be able to help if you are interested in trying it.

sonichi avatar Aug 02 '23 15:08 sonichi

One empirical study is in https://arxiv.org/pdf/2208.02922.pdf, in which @YIWEI-CHEN also demonstrates that when combined with early stopping, the performance can be further improved.

Is this implemented in flaml? Perhaps it's the native flaml scheduler?

Hi @bbudescu,

I am the author of the paper. I am implementing paper's idea to FLAML's scheduler. In the beginning, I will put stratum early stopping approach into FLAML and release a LGBM example with fair constraint (equalized odd difference). May I know what is your case study to use constraint scheduler? It would be better to illustrate your case by notebook if possible. It is also beneficial the community by considering your demand during the development. Thank you.

YIWEI-CHEN avatar Aug 23 '23 00:08 YIWEI-CHEN