ray Bug: Ray Tune get_best_result('metric', mode='max', scope='all') get's min

What happened + What you expected to happen

When I call result.get_best_result('eval_accuracy', mode='max', scope='all') I get the lowest metric score not the max metric score.

Versions / Dependencies

ubuntu 22 Python 3.11.8 ray 2.12.0

Reproduction script

scheduler = PB2(
    time_attr="training_iteration",
    perturbation_interval=perturbation_interval,
    metric="eval_accuracy",
    mode="max",
    hyperparam_bounds={
        "train_loop_config": {
            "weight_decay": [0.0, 0.03],
            "learning_rate": [5e-5, 2e-3],
        }
    },
)

# Set up the Tuner
# api docs -> https://docs.ray.io/en/latest/tune/api/doc/ray.tune.Tuner.html#ray.tune.Tuner
# ========================
tuner = Tuner(
    trainer,
    param_space={  
        "train_loop_config": {
            "learning_rate": 9e-4,
            "weight_decay": 0.03,
        }
    },

    run_config=train.RunConfig(
        name="pbt_mamba_finacial_phrasebank_sentiment",
        stop={"eval_accuracy": 0.96, "training_iteration": 25}, 
        checkpoint_config=CheckpointConfig(
            checkpoint_score_attribute="eval_accuracy",
            num_to_keep=7,  
            checkpoint_score_order="max",
        ),
        storage_path="/files/tmp/ray_results",
    ),
    # see https://docs.ray.io/en/latest/tune/api/doc/ray.tune.TuneConfig.html for details
    tune_config=tune.TuneConfig(
        scheduler=scheduler,
        num_samples=4,
        reuse_actors=True,
    )
)

result = tuner.fit()

# Finally we load the best checkpoint and saving it in safe tensors.
# ========================
best_result = result.get_best_result("eval_accuracy", mode="max", scope="all")
print('Best Result:\n==================')
print(f'{best_result}')

Issue Severity

High: It blocks me from completing my task.

Apr 28 '24 16:04 getorca

moving the metric and mode params from PBT to TuneConfig causes result.get_best_result("eval_accuracy", mode="max", scope="all") to return the best checkpoint only from the "last" checkpoints/scope not from all. Also calling ``result.get_best_result("eval_accuracy", mode="max", scope="all")` still get the best checkpoint from the "last" checkpoint, not min and not from all.

Apr 29 '24 00:04 getorca

root of the cause seems to be in _trial_to_result https://github.com/ray-project/ray/blob/54bdf9b7c55925877ceb9d2cce23eb7af52cc12f/python/ray/tune/result_grid.py#L258 which only ever looks at latest checkpoint from the best trail rather than the best checkpoint from the best trial.

The best trail is properly returned from result._experiment_analysis.get_best_trial(metric='eval_accuracy', mode='max', scope='all') which is called before returning the object with the wrong checkpoint and accuracy from _trial_to_result

if someone can point me toward any relevant testing code, contribution guidelines, that would be appreciated, and I can work on a pull request.

Although I suspect there may be more bugs, because this doesn't explain the min score returned when metric and mode are set in PB2 vs TineConfig but this is the most obvious bug.

Apr 29 '24 01:04 getorca

@getorca You may want to call result_grid.get_best_result(...).get_best_checkpoint(...) to get the best checkpoint of the best trial. Otherwise, the default checkpoint/metrics attributes are just the latest as you mentioned.

Apr 29 '24 21:04 justinvyu

@getorca You may want to call result_grid.get_best_result(...).get_best_checkpoint(...) to get the best checkpoint of the best trial. Otherwise, the default checkpoint/metrics attributes are just the latest as you mentioned.

not sure that will work, it seems like get_best_result is only returning the best_trial based off of the latest checkpoint, not the best checkpoint through history.

the following is a work around:

best_result = None
for idx, trial in enumerate(result._experiment_analysis.trials):
    tmp_result = trial.run_metadata.checkpoint_manager.best_checkpoint_result
    if best_result is None or tmp_result['eval_accuracy'] > best_result['eval_accuracy']:
        best_result = tmp_result

Apr 29 '24 23:04 getorca

The best trial is properly returned from result._experiment_analysis.get_best_trial(metric='eval_accuracy', mode='max', scope='all') which is called before returning the object with the wrong checkpoint and accuracy from _trial_to_result

Hmm, result_grid.get_best_result just calls experiment_analysis.get_best_trial(metric='eval_accuracy', mode='max', scope='all') under the hood.

Apr 30 '24 20:04 justinvyu

Look bottom of the result_grid.get_best_result function and what it returns, self._trial_to_result which as best I can tell can only ever return the best checkpoint from the last iteration in each trail.

Apr 30 '24 21:04 getorca

result.best_checkpoints should return a list of all the top checkpoints for a trial across all iterations. Maybe I'm misunderstanding something?

Apr 30 '24 21:04 justinvyu

maybe (I can't remember all the funcs, especially with similar names), but the bug is result.get_best_result('eval_accuracy', mode='max', scope='all') returns the best checkpoint from the last checkpoints for each trial. The expected behaviour when scope='all' is passed is that it get's the best metric/checkpoint from all iterations of all trials. https://docs.ray.io/en/latest/tune/api/doc/ray.tune.ResultGrid.get_best_result.html

Apr 30 '24 21:04 getorca