Problems with reinforcement learning with stable-baseline3
With the reference to this documentation, I've some problem implementing Reiforcement Learning using stable-baseline3. I've trained the model for 1 million time steps and once I load the model for rendering, the drone moves a little around its initial position, but it doesn't reach any waypoint. Is it possible that there is a problem with how the reward function is defined?
Could I see the full list of parameters you used to instantiate the class?
I just copied the code from the documentation (the link in my previous comment), in particular the part of "Flattening the Environment":
import gymnasium
import PyFlyt.gym_envs
from PyFlyt.gym_envs import FlattenWaypointEnv
env = gymnasium.make("PyFlyt/QuadX-Waypoints-v3", render_mode="human")
env = FlattenWaypointEnv(env, context_length=2)
term, trunc = False, False
obs, _ = env.reset()
while not (term or trunc):
obs, rew, term, trunc, _ = env.step(env.action_space.sample())
Are you using SAC or PPO?
PPO
With PPO you generally need about 10x the number of samples as SAC. In my experiments, 1M steps got me decent behaviour, while 3M steps got very high performance behaviour.
Your case may just need more training, try something like 10M or 20M steps.
I tried to implement a training of 2M steps on kaggle. It trained for 3 hours but during rendering there are two problems:
- waypoints change continously
- drone stays still and the simulation ends shortly after its beginning
import gymnasium
import PyFlyt.gym_envs
from stable_baselines3 import PPO,SAC,A2C,TD3
from PyFlyt.gym_envs import FlattenWaypointEnv
from stable_baselines3.common.vec_env import DummyVecEnv
from PyFlyt.gym_envs.quadx_envs import quadx_waypoints_env
env = gymnasium.make("PyFlyt/QuadX-Waypoints-v3",render_mode = None)
env = FlattenWaypointEnv(env, context_length=2)
model = PPO("MlpPolicy", env, learning_rate=1e-3, verbose=1)
model.learn(total_timesteps=2_000_000)
model.save(path)
On my pc:
model.PPO.load(path)
episodes = 10000
for _ in range(episodes):
obs , _ = env.reset()
done = trunc = False
while not done and not trunc:
action, _states = model.predict(obs)
obs, reward, done, trunc, info = env.step(action)
Waypoints should change every episode, this is normal. If you're using PPO, I would expect that the agent has just learned to hover and do not much else at 2M time steps.
By default, the environment ends after 10 seconds and a new episode begins, that's why you're seeing it end so fast.
I'm not super familiar with SB3, but you should be able to use env parallelism with PPO, and scale it up to 10M steps easily while using about the same training time. See Parallel Environments in the Example section here.
edit: do you have access to the reward after the 1M timesteps? Generally, it should be above at least 0.0 at that point if not in the 100s or above.
I can have some assistance here and hopefully this will help you out @Milly37
When using SB3 you can use:
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
env = make_vec_env(reg_env_creator(env_config),
n_envs=10,
seed=1,
vec_env_cls=SubprocVecEnv,
)
using make_vec_env will return an instance of the environment in which SB3 can use for parallelization and will make n_envs worth of environments. Then you can feed the vectorized environment into the model like so:
model = PPO(
ActorCriticPolicy,
env,
args,
)
the env_config is a dict. with whichever PyFlyt environment configuration you are using and here is an example of reg_env_creator:
def reg_env_creator(config):
def create_env():
# config is the environment constructor and is how env_config is unpacked
env = FixedwingWaypointsEnv(**config)
context_length = config.get("context_length", 2)
env = FlattenWaypointEnv(env, context_length)
return env
return create_env
and for clarity env_config can look like this:
env_config = {
'sparse_reward': False,
'num_targets': 4,
'goal_reach_distance': 4.0,
# rest of args,
}
In terms of training itself, I have found that the drone environments all converge very quickly in general, but the fixed-wing environments do not. Let us know if this helps parallelizing the env!
Thanks for the assistance @tlaurie99! 🙏🫰
@tlaurie99 thank you for your suggestion, but I failed to make it work on Kaggle. I've tried a slight different version, but I think that there is a problem with make_vec_env and the QuadXWaypointsEnv.
env = gymnasium.make("PyFlyt/QuadX-Waypoints-v3")
env = FlattenWaypointEnv(env, context_length=2)
gymnasium.make works fine. Instead, when I try env = make_vec_env("PyFlyt/QuadX-Waypoints-v3", n_envs=4) I obtain the following error:
I don't understand the meaning of this error and if someone can help me understand it, I'll appreciate it
My suspicion is that SB3 doesn't play well with the Sequence space that the waypoints environment is using to represent the waypoints.
You'll probably need to use the solution by @tlaurie99 with reg_env_creator.
@Milly37 so you can't pass the string to the make_vec_env like you would normally do for gym.make(), so in your case the make_vec_env is expecting a callable function, so it can call the function and return an environment to vectorize them. So instead, you would have to do something like I have above where you return an instance of an environment within a function. Sure, it is a little convoluted, but that's what SB3 expects. I.e. like this:
env_config = {
'sparse_reward': False,
'num_targets': 4,
'goal_reach_distance': 4.0,
'flight_mode': 0,
'flight_dome_size': 150.0,
## increase time for env. termination
'max_duration_seconds': 120.0,
'angle_representation': "quaternion",
'agent_hz': 30,
'render_mode': None,
'render_resolution': (480, 480),
}
def reg_env_creator(config):
# callable function that will return an environment when SB3 calls it
def create_env():
# creates an environment and returns an instance of the environment
env = QuadXWaypointsEnv(**config)
context_length = config.get('context_length', 2)
env = FlattenWaypointEnv(env, context_length)
return env
return create_env
env = make_vec_env(reg_env_creator(env_config), n_envs=4)
Change your config settings to what you need and this should now work. These are my settings for some fixed-wing stuff, but I threw the quad waypoints env in here for demonstration. Hope it helps!
I'm sorry to bother you again. I've tried as you said but I get this error: 'QuadXWaypointsEnv' is not defined
I think you need to import it?
At the top of the file:
from PyFlyt.gym_envs.quadx_envs.quadx_waypoints_env import QuadXWaypointsEnv. :)
Yes, I've changed import so many times that I forgot to rewrite that. I hope this will be the last time I bother you but at the moment I have problems with the visualization on my pc:
episodes = 7
for _ in range(episodes):
obs , _ = env.reset()
done = trunc = False
while not done and not trunc:
action, _states = model.predict(obs)
obs, reward, done, trunc, info = env.step(action)
I get two errors:
obs , _ = env.reset() ValueError: too many values to unpack (expected 2)obs, reward, done, trunc, info = env.step(action) ValueError: not enough values to unpack (expected 5, got 4)
@Milly37 which version of gym are you using? You can do: gym.__version__ in a cell and execute it to get the version. If it is < 0.26 then that gym will only return obs and not obs, info. If it is >=0.26 then the issue is with how you are building the environment itself. Do you have that available?
EDIT: I have corrected some errors that I was making on my pc and simulations start now.
I have one last problem: even though reward is pretty good (I attach a photo from Tensorboard) running the last model from kaggle in the simulation the drone touches the first waypoint rarely.
Do you have any idea how to resolve this?
@tlaurie99 Hello! I'm a beginner in reinforcement learning and I hope to get your help. I would be very grateful if you could provide an example of training and testing a fixed-wing UAV to follow waypoints using PyFlyt and stable_baselines3. I'm looking forward to your reply and thank you very much!
@ZF113120 something like this would work:
imports:
from stable_baselines3 import PPO
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.vec_env import SubprocVecEnv
from PyFlyt.gym_envs import FlattenWaypointEnv
from PyFlyt.gym_envs.fixedwing_envs.fixedwing_waypoints_env import FixedwingWaypointsEnv
environment creator for SB3 with PyFlyt:
def reg_env_creator(config):
# callable function that SB3 expects to call and return an env. instance
def create_env():
env = FixedwingWaypointsEnv(**config)
context_length = config.get('context_length', 2)
# use_yaw_targets = config.get('use_yaw_targets', False)
env = FlattenWaypointEnv(env, context_length)
return env
return create_env
with config:
env_config = {
'sparse_reward': False,
'num_targets': 4,
'goal_reach_distance': 4.0,
'flight_mode': 0,
'flight_dome_size': 150.0,
'max_duration_seconds': 30.0,
'angle_representation': "quaternion",
'agent_hz': 30,
}
params for SB3:
set_ctrl_freq = 40
num_envs = 10
n_steps = 402
learning_rate = 1e-3
batch_size = 4096
n_epochs = 10
total_timesteps = 10_000_000
vectorize the environment for SB3's parallelism:
env = make_vec_env(reg_env_creator(env_config),
n_envs=num_envs,
seed=0,
vec_env_cls=SubprocVecEnv,
)
set up SB3 model (with PPO):
model = PPO(
ActorCriticPolicy,
env, # this will be the vectorized env as above
learning_rate=learning_rate,
n_steps=n_steps,
batch_size=batch_size,
n_epochs=n_epochs,
verbose=1,
)
run the model:
model.learn(total_timesteps=total_timesteps)
@tlaurie99 Thank you very much for your help!
Wish you happiness every day!
Thanks @tlaurie99 !
@ZF113120 If you're still experiencing issues, maybe try commenting out this line.
@tlaurie99 @jjshoots
I noticed a minor issue while running the following program: the terminal keeps flickering with "argv[0]=", yet it's not from a print or output statement. If you're interested, you could try running the code. I'd appreciate it if you could help me fix this issue in your free time to enhance PyFlyt.
import gymnasium as gym
from stable_baselines3 import PPO, A2C
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
import PyFlyt
import PyFlyt.gym_envs
from PyFlyt.gym_envs import FlattenWaypointEnv
if __name__ == '__main__':
set_ctrl_freq = 40
num_envs = 10
n_steps = 402
learning_rate = 1e-3
batch_size = 4096
n_epochs = 10
total_timesteps = 10_000_000
env = gym.make("PyFlyt/Fixedwing-Waypoints-v4", render_mode=None)
train_env = FlattenWaypointEnv(env, context_length=2)
model = PPO(
ActorCriticPolicy,
train_env, # this will be the vectorized env as above
learning_rate=learning_rate,
n_steps=n_steps,
batch_size=batch_size,
n_epochs=n_epochs,
verbose=1,
)
model.learn(total_timesteps=total_timesteps)
I'm looking forward to your reply!
Wish you happiness every day!
Hi @ZF113120 , that printout is a native printout from PyBullet as your code spins up parallel environments. It goes away because this line deletes that printout again.
@jjshoots Thanks for your prompt reply and for clearing up my doubts! I’m truly grateful!