keras-tuner icon indicating copy to clipboard operation
keras-tuner copied to clipboard

tuner with max_model_size not skipping oversized models

Open DavidBSauer opened this issue 5 years ago • 13 comments

When using max_model_size I am finding the tuner repeatedly tries the same oversized model, then errors out. This does not fit with the expected behavior as the warning message says the oversized model will be skipped.

Some dummy code to recreate the issue:

def modelbuilder(hp):
	model = keras.Sequential()
	model.add(keras.layers.Dense(hp.Int('width_1',2,20,step=1,sampling='linear'),input_shape=[100],activation='linear',name='Dense_1'))
	model.add(keras.layers.Dense(1,activation='linear',name='output'))
	optimizer = keras.optimizers.Adam(lr=0.01,beta_1=0.9,beta_2=0.999,epsilon= 1e-8)
	model.compile(loss='mse',optimizer=optimizer,metrics=['mse'])
	return model

...
tuner = RandomSearch(modelbuilder,objective='val_loss',max_trials=100,max_model_size=1800,overwrite=True)
tuner.search(x=train_data,y=train_target,epochs=3,validation_data=(valid_data,valid_target),verbose=0)

Giving an output of:

[Trial complete] [Trial summary]

Hp values: |-width_1: 11 |-Score: 0.08742512226104736 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 12 |-Score: 0.09093187510967254 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 8 |-Score: 0.09126158863306046 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 14 |-Score: 0.10759384512901306 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 6 |-Score: 0.08583792209625245 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 10 |-Score: 0.09473302498459817 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 3 |-Score: 0.13883694425225257 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 13 |-Score: 0.08331702768802643 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 7 |-Score: 0.10881250619888305 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 17 |-Score: 0.09528512597084045 |-Best step: 0 [Warning] Oversized model: 2041 parameters -- skipping [Warning] Oversized model: 2041 parameters -- skipping [Warning] Oversized model: 2041 parameters -- skipping [Warning] Oversized model: 2041 parameters -- skipping [Warning] Oversized model: 2041 parameters -- skipping [Warning] Oversized model: 2041 parameters -- skipping

DavidBSauer avatar Dec 03 '19 21:12 DavidBSauer

@DavidBSauer Thanks for the issue!

Agreed it's weird that is seems to be trying the same HP combination multiple times, will take a look at this

omalleyt12 avatar Jan 06 '20 18:01 omalleyt12

I notice this quits after 6 oversized models. Is the maximum number of oversized models hard-coded? Related to my previous issue (https://github.com/keras-team/keras-tuner/issues/173), I expect the number of right-sized models to be few relative to total parameter space. Therefore, it would be handy to continue despite multiple oversized models.

DavidBSauer avatar Jan 09 '20 21:01 DavidBSauer

@DavidBSauer it's hardcoded here right now, but we could change that

I'm not sure what to do about max_model_size in general. IMO the implementation needs to be rethought a bit. There are two problems:

  1. The correct implementation would have to report back to the Oracle class that the trial was deemed invalid, and then request new hyperparameters. The current implementation, as surfaced by your issue, is just trying the same Model a few times and then failing

  2. We can't compute the size when subclassing tf.keras.Model until the model has been built with some data (for Functional Models and Sequential Models we know the size ahead of time because they're less dynamic)

(1) and (2) seem to argue for moving the max_model_size functionality into the Tuner.on_batch_end method, and calling tuner.oracle.end_trial(...) with a status of infeasible if the Model is oversized

To your point, I don't see any downside of letting the search continue while just skipping however many models happen to be oversized. As long as we display appropriate error messages, and skip the models quickly, it shouldn't run the risk of wasting a lot of user resources. I'd probably prefer that over adding another argument to the Tuner about how many oversized models to allow

WDYT, would that match with your use case and expectations?

omalleyt12 avatar Jan 09 '20 21:01 omalleyt12

Actually, we'd have to put it in Tuner.on_epoch_end since Keras models only check if they should stop training each epoch, and also we wouldn't want to slow down each batch. That would slow down significantly searches where there were a large number of oversized Models though. It might make sense to just solve (1) and not worry about (2) for now

omalleyt12 avatar Jan 09 '20 22:01 omalleyt12

Yes, just removing the hardcoded limit and just letting it continue until max_trials is hit would work for me. Therefore, so long a single valid model exists, it will finish (even if skipping many models).

DavidBSauer avatar Jan 09 '20 23:01 DavidBSauer

I am assuming max_trials is comparing against the number of evaluated models, not including models skipped due to being oversized. Is this correct?

DavidBSauer avatar Jan 10 '20 16:01 DavidBSauer

Looks like this may be the culprit for crashing AutoKeras resulting in OOM errors that I've been facing. It is a huge problem for me at the moment. Any further updates regarding this issue? https://github.com/keras-team/autokeras/issues/1078

ghost avatar Apr 08 '20 06:04 ghost

Any updates on this, especially 1 (quoted below)? I'm having a related issue- some of the models end up being too large and I get an OOM error, which halts tuning. When I try running search again, it seems to try the same combination again so I keep getting the error message and can't move past it to a new combination.

  1. The correct implementation would have to report back to the Oracle class that the trial was deemed invalid, and then request new hyperparameters. The current implementation, as surfaced by your issue, is just trying the same Model a few times and then failing

JenineJ avatar Jul 09 '20 02:07 JenineJ

I am stuck with the same issue. I am unable to include max_model_size in the tuning process. Is anyone aware of a solution? Below is the runtime error. Thanks.

Oversized model: 1364002 parameters -- skipping Oversized model: 1364002 parameters -- skipping Oversized model: 1364002 parameters -- skipping Oversized model: 1364002 parameters -- skipping Oversized model: 1364002 parameters -- skipping Oversized model: 1364002 parameters -- skipping

RuntimeError: Too many consecutive oversized models.

satvik-venkatesh avatar Dec 17 '20 08:12 satvik-venkatesh

Same problem here. As far as I can tell, max model size is just trading one kill switch for another. Instead of autokeras running out of memory, now it just quits if it doesn't choose a model that fits within the max model size.

I have 4 2080 Ti GPUs - if I can't use this with 256x256 images, is anyone having any success?

leedrake5 avatar Apr 01 '21 18:04 leedrake5

Hey everyone,

I have created a dirty fix that can be used with models created with either the Keras Sequential or Functional API (i.e. you should at least be able to use the Keras backend method count_params on your model). This dirty fix can be used while we wait for a general solution to be envisioned and developed.

The basic idea is, once you have instantiated a high level tuner class (e.g. BayesianOptimization), to overwrite a few methods inherited from parent classes: the Tuner's class _build_and_fit_model method and BaseTuner's class on_trial_end method.

class BayesianSearchEdit(bayesian.BayesianOptimization):
    """
    TO-DO: add custom max_model_size input param to class
    def __init__(self):
        pass
    """

    def on_trial_end(self, trial):
        """A hook called after each trial is run.
        # Arguments:
            trial: A `Trial` instance.
        """
        # Send status to Logger
        if self.logger:
            self.logger.report_trial_state(trial.trial_id, trial.get_state())

        if not trial.get_state().get("status") == trial_module.TrialStatus.INVALID:
            self.oracle.end_trial(trial.trial_id, trial_module.TrialStatus.COMPLETED)

        self.oracle.update_space(trial.hyperparameters)
        # Display needs the updated trial scored by the Oracle.
        self._display.on_trial_end(self.oracle.get_trial(trial.trial_id))
        self.save()

    def _build_and_fit_model(self, trial, fit_args, fit_kwargs):
        model = self.hypermodel.build(trial.hyperparameters)
        model_size = self.maybe_compute_model_size(model)
        print("Considering model with size: {}".format(model_size))

        if model_size > CUSTOM_MAX_MODEL_SIZE:
            self.oracle.end_trial(trial.trial_id, trial_module.TrialStatus.INVALID)

            dummy_history_obj = tf.keras.callbacks.History()
            dummy_history_obj.on_train_begin()
            dummy_history_obj.history.setdefault('val_loss', []).append(2.5)
            return dummy_history_obj

        return model.fit(*fit_args, **fit_kwargs)

    def maybe_compute_model_size(self, model):
        """Compute the size of a given model, if it has been built."""
        if model.built:
            params = [tf.keras.backend.count_params(p) for p in model.trainable_weights]
            return int(np.sum(params))
        return 0

This dirty fix essentially just skips over trials. Therefore, I advice to increase the number of trials to account for the lost trials.

Regards, Bram

Updates 04/05/2021 and 08/05/2021 To prevent Keras Tuner from crashing due to GPU Out Of Memory (OOM) exceptions, you can add exception handling to the model.fit method call (only tested with TensorFlow 2.3.0 so far):

        try:
            return model.fit(*fit_args, **fit_kwargs)
        except (tf.errors.ResourceExhaustedError, tf.errors.InternalError):
            self.oracle.end_trial(trial.trial_id, trial_module.TrialStatus.INVALID)

            dummy_history_obj = tf.keras.callbacks.History()
            dummy_history_obj.on_train_begin()
            dummy_history_obj.history.setdefault('val_loss', []).append(2.5)
            return dummy_history_obj

These crashes can still happen (happen less frequent with checking model size) due to the error margin present in manual model size calculation methods. In addition to the ResourceExhaustedError, in case a tf.distribute strategy is used during training, the InternalError also has to be handled because TensorFlow may raise any one of the two errors when the GPU is OOM.

After running the HyperBand tuner for a large number of trials, I discovered that the line model = self.hypermodel.build(trial.hyperparameters) was raising the RuntimeError as a result of consecutive GPU OOM errors. This error has been fixed by removing global Keras callbacks from the search method and including them locally in the fit_kwargs argument in the _build_and_fit_model method, e.g.:

        fit_kwargs["callbacks"].extend([
            tf.keras.callbacks.EarlyStopping(monitor="val_loss", min_delta=0, patience=5, restore_best_weights=True)
        ])

bberlo avatar Apr 29 '21 17:04 bberlo

Any updates on this? The hack @bberlo proposed only works with Bayesian. It breaks with other Tuners such as RandomSearch

Traceback (most recent call last):
...
295, in run_trial
    obj_value = self._build_and_fit_model(trial, *args, **copied_kwargs)
TypeError: TunerWithFixedMaxModelSize._build_and_fit_model() got an unexpected keyword argument 'x'

darrenrahnemoon avatar Mar 22 '23 23:03 darrenrahnemoon

Any updates on this? The hack @bberlo proposed only works with Bayesian. It breaks with other Tuners such as RandomSearch

Traceback (most recent call last):
...
295, in run_trial
    obj_value = self._build_and_fit_model(trial, *args, **copied_kwargs)
TypeError: TunerWithFixedMaxModelSize._build_and_fit_model() got an unexpected keyword argument 'x'

This solution has been tested with hyperband and bayesian optimization, keras tuner version 1.0.2. Since it's a dirty fix, no official support is provided by keras tuner to keep this fix updated according to code breaking changes.

bberlo avatar Mar 23 '23 06:03 bberlo