stable-baselines3
stable-baselines3 copied to clipboard
[Feature Request] Discounted Return value in policy evaluation
🚀 Feature
Enable the user to save discounted return instead of accumulated reward in policy evaluation.
Motivation
Currently, EvalCallback and policy evaluation provide the user with accumulated rewards per episode. However, it is sometimes more sensible to report the discounted return: $$\sum_{i} \gamma^{i} r_{i}$$ This cannot be inferred from the accumulated reward, and so additional information must be provided.
Pitch
Add an argument to EvalCallback calc_discounted_return (or whatever you want to call it) that, when True, changes the output to be the discounted return instead of the accumulated reward. The $\gamma$ value is available in the model to which the callback has access. Even though there is no gamma field in the BaseAlgorithm class, it is available in both OnPolicyAlgorithm and OffPolicyAlgorithm, which are all the algorithm types in SB3.
Alternatives
Change EvalCallback and policy evaluation to return all individual rewards in the episode. Specifically, the evaluate_policy function should return episode_rewards, episode_lengths such that episode_lengths is the same, but episode_rewards is a 2d matrix where episode_rewards[i][j] is the reward recieved in the i'th episode after performing an action at step j. In turn, EvalCallback will save a 3d matrix that contains the evaluation's 2d matrices for each evaluation call during learning. This will give the user the power to perform any calculations they desire for evaluation purposes.
Hello,
However, it is sometimes more sensible to report the discounted return:
Could you elaborate where/when you would like to do that and why?
This will give the user the power to perform any calculations they desire for evaluation purposes.
I think you should look into the callback argument of evaluate_policy and write a custom EvalCallback:
https://github.com/DLR-RM/stable-baselines3/blob/52c29dc497fa2eb235d0476b067bed8ac488fe64/stable_baselines3/common/evaluation.py#L17
It has access to all variables from the evaluation: https://github.com/DLR-RM/stable-baselines3/blob/52c29dc497fa2eb235d0476b067bed8ac488fe64/stable_baselines3/common/evaluation.py#L94-L100
Even though there is no gamma field in the BaseAlgorithm class, it is available in both OnPolicyAlgorithm and OffPolicyAlgorithm, which are all the algorithm types in SB3.
Actually no... we have ARS/CEM in SB3 contrib, where gamma doesn't make sense.
PS: callback argument is used here in EvalCallback: https://github.com/DLR-RM/stable-baselines3/blob/d64bcb401ad7d45799af1feee5c1058943be23f0/stable_baselines3/common/callbacks.py#L401
Could you elaborate where/when you would like to do that and why?
In most cases, this is what the algorithm is optimizing. It is useful to see the progress of training relative to the actual objective. This is also seen in basic classification problems where we plot the training/validation curve of the optimized loss over time, even though we are really trying to optimize accuracy.
I think you should look into the callback argument of evaluate_policy and write a custom EvalCallback
I have already done this for my current experiment. . I just figured this should be implemented in the standard SB3 since this is such an important metric (and the code repetition burns my eyes). I would be surprised to learn that I'm the only one who needs this feature.
Actually no... we have ARS/CEM in SB3 contrib, where gamma doesn't make sense.
I see. Then perhaps gamma should be an argument that defines the accumulation discount. Set it to 1 by default which is equivalent to accumulated reward.
In most cases, this is what the algorithm is optimizing. It is useful to see the progress of training relative to the actual objective.
Then maybe the right place would be to use the Monitor wrapper for that and not EvalCallback?
EDIT: by using a custom Monitor callback, you can log the discounted sum of reward without any change to EvalCallback