cmaes why is the Optuna CMA-ES sampler better than my custom cmaes code?

why is the Optuna CMA-ES sampler better than my custom cmaes code?

Open FlorinAndrei opened this issue 6 months ago • 12 comments

Summary of the Question

I have a dataset with features in the X dataframe, and the target in the y series. I am trying to select the features in X such that, when fitting a model, I reach the extreme of some objective function. Specifically, the model is linear regression, and the objective is BIC (Bayesian Information Criterion) - I'm trying to select the features in X so as to minimize BIC for the model.

X has a very large number of features, so exhaustive search of all feature combinations is not doable. But if you create a list of binary flags [1, 0, 0, 1, 1, 1, 0, 1, 0, 0, ...], one flag for each feature (0 means feature is not selected, 1 means feature is selected), then the problem becomes one of hyperparameter optimization: find the binary flag values that minimize BIC, by selecting the best features.

I've tried to use Optuna with CmaEsSampler(). I've also tried the cmaes library directly.

For some reason, Optuna with CmaEsSampler() finds a better solution (lower BIC), and does not get stuck, but is slow. The cmaes library, at least my implementation, only finds a slightly less good solution (slightly higher BIC), and appears to get stuck in a local minimum, but it iterates much faster.

I would like to use cmaes directly because it's so much faster, but I can't make it overcome the local minimum. What am I missing?

Detailed Explanation

Optuna code:

def fs_objective(trial, X, y, features):
    features = copy.deepcopy(features)
    random.shuffle(features)
    features_use = ['const'] + [f for f in features if trial.suggest_int(f, 0, 1) == 1]
    lin_mod = sm.OLS(y, X[features_use], hasconst=True).fit()
    return lin_mod.bic

features_select = [f for f in X.columns if f != 'const']
fs_sampler = optuna.samplers.CmaEsSampler(n_startup_trials=1, seed=0, with_margin=True)
study = optuna.create_study(sampler=fs_sampler, study_name=study_name, direction='minimize')
study.optimize(lambda trial: fs_objective(trial, X, y, features_select))

My cmaes code, inspired from this link https://github.com/CyberAgentAILab/cmaes/blob/main/examples/cmaes_with_margin_binary.py

def cma_objective(fs):
    features_use = ['const'] + [f for i, f in enumerate(features_select) if fs[i,] == 1]
    lin_mod = sm.OLS(y, X[features_use], hasconst=True).fit()
    return lin_mod.bic

features_select = [f for f in X.columns if f != 'const']
cma_bounds = np.tile([0, 1], (len(features_select), 1))
cma_steps = np.ones(len(features_select))
optimizer = CMAwM(mean=np.zeros(len(features_select)), sigma=2.0, bounds=cma_bounds, steps=cma_steps, seed=0)
pop_size = optimizer.population_size

gen_max = 10000
best_value = np.inf
best_gen = 0
best_sol_raw = None
history_values = np.full((gen_max,), np.nan)
history_values_best = np.full((gen_max,), np.nan)

for generation in tqdm(range(gen_max)):
    best_value_gen = np.inf
    sol = []
    solutions = []
    vals = np.full((pop_size,), np.nan)

    for i in range(optimizer.population_size):
        fs_for_eval, fs_for_tell = optimizer.ask()
        solutions.append(fs_for_eval)
        value = cma_objective(fs_for_eval)
        vals[i] = value
        sol.append((fs_for_tell, value))
    optimizer.tell(sol)

    best_value_gen = vals.min()
    if best_value_gen < best_value:
        best_value = best_value_gen
        best_gen = generation
        best_sol_raw = solutions[np.argmin(vals)]
        print(f'gen: {best_gen:5n}, new best objective: {best_value:.4f}')
    history_values[generation] = best_value_gen
    history_values_best[generation] = best_value

    if optimizer.should_stop():
        break
gen_completed = generation

Full code - this is the notebook with all the code, both Optuna and cmaes, along with other things I've attempted, and all required context (data loading, etc):

https://github.com/FlorinAndrei/feature_selection/blob/main/feature_selection.ipynb

Context and Environment

Python 3.11.7 cmaes 0.10.0 optuna 3.5.0 Ubuntu 22.04

Additional Information

Optuna history:

optuna

cmaes history:

cmaes

Why is there so much more variance in the Optuna trials? Why is it able to maintain that variance across many trials? It seems like the Optuna code would continue to find even better combinations if I let it run even more.

Jan 06 '24 04:01 FlorinAndrei

@FlorinAndrei Thank you for using cmaes library!

Even when you have specified the CmaEsSampler in Optuna, there may be a fallback to the TPESampler when given an unsuitable search space for CMA-ES, such as a dynamic search space. In such cases, Optuna outputs a warning message to the console. Please check the warning message for more details.

Jan 08 '24 11:01 c-bata

In addition to @c-bata's comment, using CMA-ES with Margin for problems with only binary variables is not appropriate. Actually, Fig.4 in this paper depicts that other methods such as (1+1)-CMA-ES with Margin (code) and (1+1)-EA shows better performance than the CMA-ES with Margin on binary-only problems.

Jan 08 '24 13:01 nomuramasahir0

@c-bata If I specify the TPESampler explicitly in Optuna, it gets stuck very quickly in a local extreme and it does not make much progress after that, on this problem. Also, I have not seen any message from Optuna that may indicate it switches the sampler. In fact, I've tried all samplers available in Optuna, and the best performance I get is with CMA-ES, which is what prompted me to use your library directly.

@nomuramasahir0 With Optuna, I've tried the CMA-ES sampler with and without margin. Without margin, it makes essentially no progress. When I enable the margin, it performs as shown in the Optuna plot, which is the best performance I've seen yet for this problem.

So I am still unable to explain or replicate the good performance I get from Optuna with the CMA-ES sampler.

I will try the code example you've indicated.

Jan 08 '24 19:01 FlorinAndrei

@c-bata @nomuramasahir0 I looked at the CMA-ES sampler in the Optuna code and compared it with my code. I've debugged the CmaEsSampler() class while invoking it from Optuna, and peeked at the internal variables.

it appears to use CMAwM
the mean is initialized to 0.5, which makes sense for a binary problem
sigma is 1/6
n_max_resampling is 10*number of features
but most importantly, it updates the optimizer random seed at every step

With these changes applied, my code beats everything I've seen so far for this binary selection problem, including Optuna with CMA-ES, and a genetic algorithm I've tried, in terms of the best objective function value found. And it's very fast. My new code is included at the end.

I will try (1+1)-CMA-ES with Margin later today and I'll compare it with what I have so far.

New code:

def cma_objective(fs):
    features_use = ['const'] + [f for i, f in enumerate(features_select) if fs[i,] == 1]
    lin_mod = sm.OLS(y, X[features_use], hasconst=True).fit()
    return lin_mod.bic


features_select = [f for f in X.columns if f != 'const']
cma_bounds = np.tile([0, 1], (len(features_select), 1))
cma_steps = np.ones(len(features_select))
n_max_resampling = 10 * len(features_select)
optimizer = CMAwM(
    mean=np.full(len(features_select), 0.5),
    sigma=1 / 6,
    bounds=cma_bounds,
    steps=cma_steps,
    n_max_resampling=n_max_resampling,
)
pop_size = optimizer.population_size

gen_max = 1000
best_value = np.inf
best_gen = 0
best_sol_raw = None
history_values = np.full((gen_max,), np.nan)
history_values_best = np.full((gen_max,), np.nan)
rs = np.random.RandomState(seed=0)

for generation in tqdm(range(gen_max)):
    best_value_gen = np.inf
    sol = []
    solutions = []
    vals = np.full((pop_size,), np.nan)

    for i in range(optimizer.population_size):
        seed = rs.randint(1, 2**16) + generation
        optimizer._rng.seed(seed)
        fs_for_eval, fs_for_tell = optimizer.ask()
        solutions.append(fs_for_eval)
        value = cma_objective(fs_for_eval)
        vals[i] = value
        sol.append((fs_for_tell, value))
    optimizer.tell(sol)

    best_value_gen = vals.min()
    if best_value_gen < best_value:
        best_value = best_value_gen
        best_gen = generation
        best_sol_raw = solutions[np.argmin(vals)]
        print(f'gen: {best_gen:5n}, new best objective: {best_value:.4f}')
    history_values[generation] = best_value_gen
    history_values_best[generation] = best_value

    if optimizer.should_stop():
        break
gen_completed = generation

Jan 08 '24 22:01 FlorinAndrei

Thank you for the detailed situation. Maybe I've got the situation.

With Optuna, I've tried the CMA-ES sampler with and without margin. Without margin, it makes essentially no progress.

Yes, CMA with Margin is clearly better than the vanilla CMA (i.e. without Margin), but CMA with Margin itself is not originally developed for binary-only problems, and then the effectiveness for such problems has not been verified so far.

I'm not sure, but the performance may be improved by initializing the optimizer itself (i.e., initializing the mean vector randomly, and the step-size), as in BIPOP-CMA-ES (example code), when restart occurs (i.e., .should_stop() is True).

I will try (1+1)-CMA-ES with Margin later today and I'll compare it with what I have so far.

Great. I look forward to seeing the new result.

Jan 08 '24 23:01 nomuramasahir0

I've tried (1+1)-CMA-ES. It's tricky to do a direct comparison with CMA-ES with Margin, because the (1+1) code is quite different (generation count is not exposed, etc) and I didn't have time to dig too deeply into it. But I've set it up such that both codes run in about the same time, approx 1 min 30 sec on my Ryzen 7 5800X3D machine.

Bottom line is: (1+1) is not able to find the same best value. It's actually worse than sequential search. It's quite possible I made a mistake, since the (1+1) library is not written for general consumption.

This is a minimization problem - lower objective is better. Here are the performance results from various algorithms:

sequential feature selection (forward direction): 33708.98602877906
genetic algorithm: 33705.569572544795
Optuna with CmaEsSampler(): 33703.616274004984
CMA-ES with margin (my code, see previous comment): 33703.0705
(1+1)-CMA-ES (new code, see below): 33710.03390792892

Comparison between CMA-ES with margin, and (1+1)-CMA-ES. I've adjusted the X axis such that it represents actual time (as opposed to generations, etc):

output

(1+1)-CMA-ES code:

import cma.objective_function.binary as f_bin
import cma.optimizer.cmaeswm_elitist as cma_e
import cma.util.sampler as sampler
import cma.util.log as logger
from cma.objective_function.base import *

features_select = [f for f in X.columns if f != 'const']
dimensions = len(features_select)
# max_eval = dimensions ** 3
max_eval = 20000
discrete_space = np.tile(np.arange(0, 2, 1), (dimensions, 1))


class Cma1p1Obj(ObjectiveFunction):
    minimization_problem = True

    def __init__(self, d, target_eval=None, max_eval=1e4):
        self.d = d
        if target_eval is None:
            target_eval = self.d
        super(Cma1p1Obj, self).__init__(target_eval, max_eval)

    def __call__(self, fs):
        self.eval_count += len(fs)
        features_use = ['const'] + [f for i, f in enumerate(features_select) if fs[0, i] == 1]
        lin_mod = sm.OLS(y, X[features_use], hasconst=True).fit()
        evals = np.full((1), lin_mod.bic)
        self._update_best_eval(evals)
        return evals


objective = Cma1p1Obj(dimensions, max_eval=max_eval)
init_m, init_sigma = f_bin.initial_setting_for_gaussian(objective)
s = sampler.Sampler(objective, 1)
cma1p1_log_file = 'cmaes-bin-1p1.csv'
l = logger.DataLogger(cma1p1_log_file)

optimizer = cma_e.CMAESwM_elitist(
    d=dimensions,
    discrete_space=discrete_space,
    sampler=s,
    m=init_m,
    sigma=init_sigma,
    margin=None,
    min_problem=objective.minimization_problem,
    postprocess=True,
)

_, best_result, _ = optimizer.run(s, logger=l, verbose=False)
print(f'best result: {best_result}')

If you want to try it yourself, this is the repo, run the main notebook, it only takes a few minutes, I've disabled everything not related to CMA:

https://github.com/FlorinAndrei/feature_selection

Jan 09 '24 02:01 FlorinAndrei

Thank you for the interesting result. The reason may be that the problem is multimodal. Basically, (1+1)-CMA is a more local method, so it is fast for unimodal (i.e., easy) problems, but it may struggle to optimize multimodal (difficult) problems. On the other hand, CMA can optimize multimodal problems with global structure if the CMA hyperparametes (e.g., population size or learning rate) are set to appropriate values.

By the way, feature selection is an interesting application of CMA-ES. I'll check it when I have some time.

Jan 09 '24 04:01 nomuramasahir0

@nomuramasahir0 Thank you for your explanation. That indeed makes sense of what I'm seeing here.

I will add the plain CMA code to my stash of code snippets for feature selection.

I am planning to write an article (or maybe a series of articles) for Towards Data Science, on the topic of feature selection, and cmaes will be one of the tools I will present in the articles.

Thank you for your help!

Jan 09 '24 19:01 FlorinAndrei

Awesome. Please let me know when the article is published.

Jan 10 '24 11:01 nomuramasahir0

@nomuramasahir0 @c-bata My CMA-ES article went live 30 minutes ago on TDS.

Article: https://towardsdatascience.com/efficient-feature-selection-via-cma-es-covariance-matrix-adaptation-evolution-strategy-ee312bc7b173

Repository: https://github.com/FlorinAndrei/fast_feature_selection

Jan 12 '24 06:01 FlorinAndrei

Hi @FlorinAndrei Our paper, now available on arXiv, cites your article as a noteworthy application of CMA-ES with Margin (Ref. [6] in the paper). Thank you.

Masahiro Nomura, Masashi Shibata. cmaes : A Simple yet Practical Python Library for CMA-ES https://arxiv.org/abs/2402.01373

Feb 05 '24 06:02 nomuramasahir0

Awesome, thanks!

Feb 05 '24 18:02 FlorinAndrei

cmaes cmaes copied to clipboard

why is the Optuna CMA-ES sampler better than my custom cmaes code?

Summary of the Question

Detailed Explanation

Context and Environment

Additional Information

cmaes
cmaes copied to clipboard