ray [tune] Error saving checkpoint based on nested metric score

What happened + What you expected to happen

I tried running a simple RL training using rllib and set checkpoint_score_attr="evaluation/episode_reward_mean" in tune.run(). The training ran properly except the checkpoint saving. It showed this error messeges:

2022-08-09 07:20:15,138 ERROR checkpoint_manager.py:320 -- Result dict has no key: evaluation/episode_reward_mean. checkpoint_score_attr must be set to a key in the result dict. Valid keys are: ['evaluation', 'custom_metrics', 'episode_media', 'num_recreated_workers', 'info', 'sampler_results', 'episode_reward_max', 'episode_reward_min', 'episode_reward_mean', 'episode_len_mean', 'episodes_this_iter', 'policy_reward_min', 'policy_reward_max', 'policy_reward_mean', 'hist_stats', 'sampler_perf', 'num_faulty_episodes', 'num_healthy_workers', 'num_agent_steps_sampled', 'num_agent_steps_trained', 'num_env_steps_sampled', 'num_env_steps_trained', 'num_env_steps_sampled_this_iter', 'num_env_steps_trained_this_iter', 'timesteps_total', 'num_steps_trained_this_iter', 'agent_timesteps_total', 'timers', 'counters', 'done', 'episodes_total', 'training_iteration', 'trial_id', 'experiment_id', 'date', 'timestamp', 'time_this_iter_s', 'time_total_s', 'pid', 'hostname', 'node_ip', 'config', 'time_since_restore', 'timesteps_since_restore', 'iterations_since_restore', 'warmup_time', 'perf', 'experiment_tag']

I saw that this behavior had been previously reported (~~#14374~~ #14377) and resolved (~~#14375~~ #14379) but it reoccurred. Apparently, this line didn't reflect the mentioned pull request somehow.

Versions / Dependencies

ray[rllib]
ray==2.0.0rc0
Python 3.8.10

Tested on headless server and used virtual display (probably irrelevant)

Reproduction script

I think the script in the old issue is still valid but I tested using this similar script:

import ray
from ray import tune

from pyvirtualdisplay import Display

if __name__ == "__main__":
    ray.init()

    config = {
        "env": "CartPole-v1",
        "framework": "torch",

        "timesteps_per_iteration": 10,
        "evaluation_interval": 1,
        "evaluation_num_episodes": 1,
    }
    with Display(visible=False, size=(1400, 900)) as disp:
        analysis = tune.run(
            "DQN",
            stop={"num_env_steps_trained": 2000},
            config=config,
            num_samples=1,
            checkpoint_freq=1,
            keep_checkpoints_num=1,
            checkpoint_score_attr="evaluation/episode_reward_mean"
        )

    ray.shutdown()