stable-baselines3-contrib Decrease in reward during training with MaskablePPO

Decrease in reward during training with MaskablePPO

Open vahidqo opened this issue 1 year ago • 0 comments

❓ Question

Hi,

During training in a custom environment with MaskablePPO, the reward decreased and then converged. Is there any specific reason? It means the algorithm has found a better policy but is outputting another one?

My environment has two normalized rewards that will be weighted sum to measure the final reward. I have 19 timestep and my gamma was set to 0.001.

class customenv(gym.Env):.... env = customenv() env = ActionMasker(env, mask_fn) model = MaskablePPO(MaskableActorCriticPolicy, env, gamma = 0.0001, verbose=0) model.learn(4000000)

Thank you!

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

Sep 01 '23 07:09 vahidqo

stable-baselines3-contrib stable-baselines3-contrib copied to clipboard

Decrease in reward during training with MaskablePPO

❓ Question

Checklist

stable-baselines3-contrib
stable-baselines3-contrib copied to clipboard