[Question] Integrating Behavior Cloning With Maskable PPO
❓ Question
So I created a custom environment for chess. I want to train a MaskablePPO using imitations behavior cloning for initial weights, then run .learn() for continued training on the environment so that it starts to understand board states that behavior cloning doesnt see during training. It works with behavior cloning training the policy, but then when I got to .learn() it works until it see's a board state that it has never seen before, then action mask makes all valid actions and invalid actions "but found invalid values: tensor([[7.9758e-07, 7.9733e-07, 7.9821e-07, ..., 7.9767e-07, 7.9793e-07, 7.9746e-07]])" was wondering if there was a way to overcome this issue on my end? Potentially if it gets to something that it has never seen before, make all valid actions uniform in value since it doesn't know yet? Or is this a deeper issue that hasn't been solved/thought of yet?
Checklist
- [x] I have checked that there is no similar issue in the repo
- [x] I have read the documentation
- [x] If code there is, it is minimal and working
- [x] If code there is, it is formatted using the markdown code blocks for both code and stack traces.
until it see's a board state that it has never seen before, then action mask makes all valid actions and invalid actions
a word is missing there? I'm not sure to understand what you mean. Also double check that the action mask is correct and not masking everything.
So when I attempt online learning, it works just fine. It when I initialize the weights of the maskable ppo model via behavior cloning, load it into the model, and then run the training, once it gets to a state it has never seen before, it returns an error that all logits do not sum to 1, that they are all approximately zero. Instead of attempting one of the valid actions, it has all actions equal to 0.