stable-baselines3 [Question] Manually Controlling Actions During PPO Training

❓ Question

Thank you very much for creating such an excellent tool. I am currently using the PPO algorithm in Stable-Baselines3 (SB3) for training in a custom environment. During this process, I encountered an issue that I would appreciate your guidance on.

When I call model.learn(total_timesteps=10e6), the PPO model blocks the current thread and focuses entirely on the learning process. However, this causes the communication within the environment to stop running during the training. I would like to manually control the actions during the training, similar to the following process:

action, _states = model.predict(obs)
obs, reward, terminated, truncated, info = env.step(action)

Is there a way to continue training the PPO model while allowing manual control over the action selection, and keeping the environment’s communication running? Do you have any recommended solutions for this? I greatly appreciate your time and any insights you can provide. Your work has been incredibly valuable, and I look forward to any suggestions you might have.

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

Sep 25 '24 13:09 wayne-weiwei

Hello, this is hard to answer if you don't provide a minimal example to reproduce the behavior. .learn() does two things (see docs): collect data and train the model (when it updates the model, no data is collected so that might be what you are seeing).

Oct 04 '24 06:10 araffin

Thank you for the reply. When I set up a custom gym environment in Webots and used the following code for training:

env = Customer()  
check_env(env)  
    # Train
model = PPO('MlpPolicy', env, n_steps=2048, verbose=1)
model.learn(total_timesteps=10)

The algorithm did run, but it didn't work correctly in the Webots environment. The actions remained the same, and the reward never changed. However, after completing the training step, it appeared to finish normally. I'm wondering if I need to modify the learning process or if there’s something I might have missed in the environment setup.

Oct 05 '24 10:10 wayne-weiwei

stable-baselines3 stable-baselines3 copied to clipboard

[Question] Manually Controlling Actions During PPO Training

❓ Question

Checklist

stable-baselines3
stable-baselines3 copied to clipboard