Improve runtime measures for criterion plot and benchmarking plots
Current Situation / Problem you want to solve
The proposal in this issue concerns the functions criterion_plot, profile_plot and convergence_plot.
- The
criterion_plotuses the number of function evaluations (n_evaluations) as runtime measure - The
profile_plotandconvergence_plothas aruntime_measureargument that lets the user switch betweenn_evaluations,n_batches, andwalltime.
Each runtime measure serves a purpose:
- walltime: Measures how long it actually takes to achieve a certain progress. This is what a user ultimately cares about in their optimization problem.
- n_evaluations: Measures how many evaluations of the objective function it takes to achieve a certain progress. This allows to ignore optimizer overhead and use fast benchmark functions to judge the performance of an optimizer that is designed for expensive objective functions. Moreover, it is deterministic and reproducible across machines.
- n_batches: Similar to n_evaluations. In addition it allows to simulate the performance of a parallel optimizer on small machines.
n_evaluations and n_batches measure important aspects but also have a big drawback: They exclusively focus on objective functions and ignore all time that is spent on evaluating derivatives. This is not a problem as long as only derivative free or only derivative based optimizers are compared. But as soon as one compares a derivative free with a derivative based optimizer it becomes misleading.
Describe the solution you'd like
Step 1: Introduce a new runtime measures:
All relevant functions will get a runtime_measure argument which can be:
"function_time"(default): The time spent in evaluations of the user provided functionsfun,jac,fun_and_jac; Similar ton_evaluations, this will ignore the overhead of calculations done in the optimizer."batch_function_time": The time that would have been spent in evaluations of user provided functions if all evaluations of the same batch were done in parallel (without parallelization overhead)."walltime": The actual time spent (reflecting actual optimizer overheads, parallelization overheads, ...)
We also keep the legacy measures "n_evaluations" and "n_batches".
Step 2: Introduce an optional cost model
While "function_time" and "batch_function" time allow to ignore optimizer overhead, they are not deterministic nor comparable across machines. In order to achieve this, we optionally allow a user to pass a CostModel as runtime_measure. Using a CostModel allows to reproduce all existing measures except for walltime. Moreover, it allows to get reproducible and hardware agnostic runtime measures for almost any situation.
A cost model looks as follows:
@dataclass(frozen=True):
class CostModel:
fun: float | None = None
jac: float | None = None
fun_and_jac: float | None = None
label: str | None
def aggregate_batch_times(times: list[float]) -> float:
return sum(times)
The attributes fun, jac, and fun_and_jac allow a user to provide runtimes of the user provided functions. Those could be actual times in seconds or normalized values (e.g. 1 for fun). None means, that an actual measured runtime is used.
The attribute label is used as x-axis label in plots.
The method aggregate_batch_times takes a list of times (which might be measured runtimes or replaced times based on the other attributes) and returns a scalar value. The default implementation assumes that no parallelization is used.
To see the cost model in action, let's reproduce a few existing measures:
n_evaluations_cost_model = CostModel(fun=1, jac=0, fun_and_jac=0, label="evaluations of the objective function")
function_time_cost_model = CostModel(label="seconds")
@dataclass(frozen=True)
PerfectParallelizationCostModel:
def aggregate_batches(times: list[float]) -> float:
return max(times)
n_batches_cost_model = PerfectParallelizationCostModel(fun=1, jac=0, fun_and_jac=0, label="batch evaluations of the objective function")
The zero values for jac and fun_and_jac make the problems of n_evaluations and n_batches very apparent.
Potential variations
aggregate_batch_timescould be a callable attribute so users don't have to subclassCostModelto change it.- Instead of an enum for the runtime measures we could implement subclasses that capture the special cases and let users pass a CostModel subclass or instance (similar to how algorithms are passed).
- The legacy cases
n_batchesandn_evaluationscould be deprecated and only be available by using the CostModel
Questions
- Do we need multiple cost models for the plots that do multiple optimizations (e.g. profile_plot and convergence_plot)?
Very nice proposal! :tada:
This definitely fills a small but relevant gap. Some comments:
- I'd prefer
aggregate_batch_timesto be a callable attribute so that users don't have to define custom classes. I think in many scenarios, lambda functions will suffice. - I like the idea of defining special cases using
CostModelinstances (given the above point) instead of using enums. The UX should be roughly the same. I am unsure whether I would go as far as allowingCostModelclasses to be passed instead of instances. I don't see the need for that. I have something like this in mind:
or... runtime_measure = om.runtime_measure.FUNCTION_TIME, ...... runtime_measure = om.runtime_measure.CostModel( aggregate_batch_times = lambda times: max(times) ) ... - I also belive that the legacy cases could be deprecated, especially since they could still be reconstructed via
CostModel. We could add a remark on how to do that in the docs.
Regarding your question, I am unsure whether I understand it correctly. If, for example, I have a benchmark with two functions that have different runtimes of their derivative, I could use the "function_time" runtime measure, or not? And for a profile_plot we would have one function and different optimizers; I would've suspected that here again, "function_time" should work for a fair comparison?
Regarding your question, I am unsure whether I understand it correctly. If, for example, I have a benchmark with two functions that have different runtimes of their derivative, I could use the "function_time" runtime measure, or not? And for a profile_plot we would have one function and different optimizers; I would've suspected that here again, "function_time" should work for a fair comparison?
Yes, function time would be a fair comparison but it is hardware specific and not fully reproducible. In benchmarking you often want to get reproducible results and potentially even compare benchmark results generated on different computers. So we need the CostModel solution to work for benchmarks as well and unfortunately there could be cases where each problem has a different cost model.
hi, is this still relevant, can i work on this ?
HI @spline2hg we are alread working on this in #553 and it is too hard for a first-time contributor. We'll upload more issues with the "good first issue" tag in the next few days. We started with #556
Thanks for the update! We'll try to work on #561 and keep an eye out for more good first issues. Looking forward to contributing!