Bug: Ray Tune get_best_result('metric', mode='max', scope='all') get's min
What happened + What you expected to happen
When I call result.get_best_result('eval_accuracy', mode='max', scope='all') I get the lowest metric score not the max metric score.
Versions / Dependencies
ubuntu 22 Python 3.11.8 ray 2.12.0
Reproduction script
scheduler = PB2(
time_attr="training_iteration",
perturbation_interval=perturbation_interval,
metric="eval_accuracy",
mode="max",
hyperparam_bounds={
"train_loop_config": {
"weight_decay": [0.0, 0.03],
"learning_rate": [5e-5, 2e-3],
}
},
)
# Set up the Tuner
# api docs -> https://docs.ray.io/en/latest/tune/api/doc/ray.tune.Tuner.html#ray.tune.Tuner
# ========================
tuner = Tuner(
trainer,
param_space={
"train_loop_config": {
"learning_rate": 9e-4,
"weight_decay": 0.03,
}
},
run_config=train.RunConfig(
name="pbt_mamba_finacial_phrasebank_sentiment",
stop={"eval_accuracy": 0.96, "training_iteration": 25},
checkpoint_config=CheckpointConfig(
checkpoint_score_attribute="eval_accuracy",
num_to_keep=7,
checkpoint_score_order="max",
),
storage_path="/files/tmp/ray_results",
),
# see https://docs.ray.io/en/latest/tune/api/doc/ray.tune.TuneConfig.html for details
tune_config=tune.TuneConfig(
scheduler=scheduler,
num_samples=4,
reuse_actors=True,
)
)
result = tuner.fit()
# Finally we load the best checkpoint and saving it in safe tensors.
# ========================
best_result = result.get_best_result("eval_accuracy", mode="max", scope="all")
print('Best Result:\n==================')
print(f'{best_result}')
Issue Severity
High: It blocks me from completing my task.
moving the metric and mode params from PBT to TuneConfig causes result.get_best_result("eval_accuracy", mode="max", scope="all") to return the best checkpoint only from the "last" checkpoints/scope not from all. Also calling ``result.get_best_result("eval_accuracy", mode="max", scope="all")` still get the best checkpoint from the "last" checkpoint, not min and not from all.
root of the cause seems to be in _trial_to_result https://github.com/ray-project/ray/blob/54bdf9b7c55925877ceb9d2cce23eb7af52cc12f/python/ray/tune/result_grid.py#L258 which only ever looks at latest checkpoint from the best trail rather than the best checkpoint from the best trial.
The best trail is properly returned from result._experiment_analysis.get_best_trial(metric='eval_accuracy', mode='max', scope='all') which is called before returning the object with the wrong checkpoint and accuracy from _trial_to_result
if someone can point me toward any relevant testing code, contribution guidelines, that would be appreciated, and I can work on a pull request.
Although I suspect there may be more bugs, because this doesn't explain the min score returned when metric and mode are set in PB2 vs TineConfig but this is the most obvious bug.
@getorca You may want to call result_grid.get_best_result(...).get_best_checkpoint(...) to get the best checkpoint of the best trial. Otherwise, the default checkpoint/metrics attributes are just the latest as you mentioned.
@getorca You may want to call
result_grid.get_best_result(...).get_best_checkpoint(...)to get the best checkpoint of the best trial. Otherwise, the defaultcheckpoint/metricsattributes are just the latest as you mentioned.
not sure that will work, it seems like get_best_result is only returning the best_trial based off of the latest checkpoint, not the best checkpoint through history.
the following is a work around:
best_result = None
for idx, trial in enumerate(result._experiment_analysis.trials):
tmp_result = trial.run_metadata.checkpoint_manager.best_checkpoint_result
if best_result is None or tmp_result['eval_accuracy'] > best_result['eval_accuracy']:
best_result = tmp_result
The best trial is properly returned from result._experiment_analysis.get_best_trial(metric='eval_accuracy', mode='max', scope='all') which is called before returning the object with the wrong checkpoint and accuracy from
_trial_to_result
Hmm, result_grid.get_best_result just calls experiment_analysis.get_best_trial(metric='eval_accuracy', mode='max', scope='all') under the hood.
Look bottom of the result_grid.get_best_result function and what it returns, self._trial_to_result which as best I can tell can only ever return the best checkpoint from the last iteration in each trail.
result.best_checkpoints should return a list of all the top checkpoints for a trial across all iterations. Maybe I'm misunderstanding something?
maybe (I can't remember all the funcs, especially with similar names), but the bug is result.get_best_result('eval_accuracy', mode='max', scope='all') returns the best checkpoint from the last checkpoints for each trial. The expected behaviour when scope='all' is passed is that it get's the best metric/checkpoint from all iterations of all trials. https://docs.ray.io/en/latest/tune/api/doc/ray.tune.ResultGrid.get_best_result.html