stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

[Question] Manually Controlling Actions During PPO Training

Open wayne-weiwei opened this issue 1 year ago • 2 comments

❓ Question

Thank you very much for creating such an excellent tool. I am currently using the PPO algorithm in Stable-Baselines3 (SB3) for training in a custom environment. During this process, I encountered an issue that I would appreciate your guidance on.

When I call model.learn(total_timesteps=10e6), the PPO model blocks the current thread and focuses entirely on the learning process. However, this causes the communication within the environment to stop running during the training. I would like to manually control the actions during the training, similar to the following process:

action, _states = model.predict(obs)
obs, reward, terminated, truncated, info = env.step(action)

Is there a way to continue training the PPO model while allowing manual control over the action selection, and keeping the environment’s communication running? Do you have any recommended solutions for this? I greatly appreciate your time and any insights you can provide. Your work has been incredibly valuable, and I look forward to any suggestions you might have.

Checklist

wayne-weiwei avatar Sep 25 '24 13:09 wayne-weiwei

Hello, this is hard to answer if you don't provide a minimal example to reproduce the behavior. .learn() does two things (see docs): collect data and train the model (when it updates the model, no data is collected so that might be what you are seeing).

araffin avatar Oct 04 '24 06:10 araffin

Thank you for the reply. When I set up a custom gym environment in Webots and used the following code for training:

env = Customer()  
check_env(env)  
    # Train
model = PPO('MlpPolicy', env, n_steps=2048, verbose=1)
model.learn(total_timesteps=10)

The algorithm did run, but it didn't work correctly in the Webots environment. The actions remained the same, and the reward never changed. However, after completing the training step, it appeared to finish normally. I'm wondering if I need to modify the learning process or if there’s something I might have missed in the environment setup.

wayne-weiwei avatar Oct 05 '24 10:10 wayne-weiwei