stable-baselines3-contrib icon indicating copy to clipboard operation
stable-baselines3-contrib copied to clipboard

[Question] Integrating Behavior Cloning With Maskable PPO

Open kaihansen8 opened this issue 8 months ago • 2 comments

❓ Question

So I created a custom environment for chess. I want to train a MaskablePPO using imitations behavior cloning for initial weights, then run .learn() for continued training on the environment so that it starts to understand board states that behavior cloning doesnt see during training. It works with behavior cloning training the policy, but then when I got to .learn() it works until it see's a board state that it has never seen before, then action mask makes all valid actions and invalid actions "but found invalid values: tensor([[7.9758e-07, 7.9733e-07, 7.9821e-07, ..., 7.9767e-07, 7.9793e-07, 7.9746e-07]])" was wondering if there was a way to overcome this issue on my end? Potentially if it gets to something that it has never seen before, make all valid actions uniform in value since it doesn't know yet? Or is this a deeper issue that hasn't been solved/thought of yet?

Checklist

kaihansen8 avatar Apr 19 '25 21:04 kaihansen8

until it see's a board state that it has never seen before, then action mask makes all valid actions and invalid actions

a word is missing there? I'm not sure to understand what you mean. Also double check that the action mask is correct and not masking everything.

araffin avatar Apr 24 '25 09:04 araffin

So when I attempt online learning, it works just fine. It when I initialize the weights of the maskable ppo model via behavior cloning, load it into the model, and then run the training, once it gets to a state it has never seen before, it returns an error that all logits do not sum to 1, that they are all approximately zero. Instead of attempting one of the valid actions, it has all actions equal to 0.

kaihansen8 avatar Apr 24 '25 13:04 kaihansen8