PyFlyt Problems with reinforcement learning with stable-baseline3

With the reference to this documentation, I've some problem implementing Reiforcement Learning using stable-baseline3. I've trained the model for 1 million time steps and once I load the model for rendering, the drone moves a little around its initial position, but it doesn't reach any waypoint. Is it possible that there is a problem with how the reward function is defined?

Jan 23 '25 14:01 Milly37

Could I see the full list of parameters you used to instantiate the class?

Jan 23 '25 21:01 jjshoots

I just copied the code from the documentation (the link in my previous comment), in particular the part of "Flattening the Environment":

import gymnasium
import PyFlyt.gym_envs
from PyFlyt.gym_envs import FlattenWaypointEnv

env = gymnasium.make("PyFlyt/QuadX-Waypoints-v3", render_mode="human")
env = FlattenWaypointEnv(env, context_length=2)

term, trunc = False, False
obs, _ = env.reset()
while not (term or trunc):
    obs, rew, term, trunc, _ = env.step(env.action_space.sample())

Jan 28 '25 09:01 Milly37

Are you using SAC or PPO?

Jan 28 '25 10:01 jjshoots

PPO

Jan 28 '25 10:01 Milly37

With PPO you generally need about 10x the number of samples as SAC. In my experiments, 1M steps got me decent behaviour, while 3M steps got very high performance behaviour.

Your case may just need more training, try something like 10M or 20M steps.

Jan 28 '25 16:01 jjshoots

I tried to implement a training of 2M steps on kaggle. It trained for 3 hours but during rendering there are two problems:

waypoints change continously
drone stays still and the simulation ends shortly after its beginning

import gymnasium
import PyFlyt.gym_envs
from stable_baselines3 import PPO,SAC,A2C,TD3
from PyFlyt.gym_envs import FlattenWaypointEnv
from stable_baselines3.common.vec_env import DummyVecEnv
from PyFlyt.gym_envs.quadx_envs import quadx_waypoints_env

env = gymnasium.make("PyFlyt/QuadX-Waypoints-v3",render_mode = None)
env = FlattenWaypointEnv(env, context_length=2)

model = PPO("MlpPolicy", env, learning_rate=1e-3, verbose=1)
model.learn(total_timesteps=2_000_000) 
model.save(path)

On my pc:

model.PPO.load(path)
episodes = 10000
for _ in range(episodes):
    obs , _ = env.reset()
    done = trunc = False
    while not done and not trunc:
        action, _states = model.predict(obs)
        obs, reward, done, trunc, info = env.step(action)

Jan 29 '25 09:01 Milly37

Waypoints should change every episode, this is normal. If you're using PPO, I would expect that the agent has just learned to hover and do not much else at 2M time steps.

By default, the environment ends after 10 seconds and a new episode begins, that's why you're seeing it end so fast.

I'm not super familiar with SB3, but you should be able to use env parallelism with PPO, and scale it up to 10M steps easily while using about the same training time. See Parallel Environments in the Example section here.

Jan 30 '25 03:01 jjshoots

edit: do you have access to the reward after the 1M timesteps? Generally, it should be above at least 0.0 at that point if not in the 100s or above.

I can have some assistance here and hopefully this will help you out @Milly37

When using SB3 you can use:

from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env

env = make_vec_env(reg_env_creator(env_config),
                                   n_envs=10,
                                   seed=1,
                                   vec_env_cls=SubprocVecEnv,
                                   )

using make_vec_env will return an instance of the environment in which SB3 can use for parallelization and will make n_envs worth of environments. Then you can feed the vectorized environment into the model like so:

model = PPO(
          ActorCriticPolicy,
          env,
          args,
          )

the env_config is a dict. with whichever PyFlyt environment configuration you are using and here is an example of reg_env_creator:

def reg_env_creator(config):
      def create_env():
            # config is the environment constructor and is how env_config is unpacked
            env = FixedwingWaypointsEnv(**config)
            context_length = config.get("context_length", 2)
            env = FlattenWaypointEnv(env, context_length)
            return env
      return create_env

and for clarity env_config can look like this:

env_config = {
   'sparse_reward': False,
   'num_targets': 4,
   'goal_reach_distance': 4.0,
   # rest of args,
}

In terms of training itself, I have found that the drone environments all converge very quickly in general, but the fixed-wing environments do not. Let us know if this helps parallelizing the env!

Feb 01 '25 15:02 tlaurie99

Thanks for the assistance @tlaurie99! 🙏🫰

Feb 01 '25 16:02 jjshoots

@tlaurie99 thank you for your suggestion, but I failed to make it work on Kaggle. I've tried a slight different version, but I think that there is a problem with make_vec_env and the QuadXWaypointsEnv.

env = gymnasium.make("PyFlyt/QuadX-Waypoints-v3")
env = FlattenWaypointEnv(env, context_length=2)

gymnasium.make works fine. Instead, when I try env = make_vec_env("PyFlyt/QuadX-Waypoints-v3", n_envs=4) I obtain the following error:

I don't understand the meaning of this error and if someone can help me understand it, I'll appreciate it

Feb 03 '25 18:02 Milly37

My suspicion is that SB3 doesn't play well with the Sequence space that the waypoints environment is using to represent the waypoints.

You'll probably need to use the solution by @tlaurie99 with reg_env_creator.

Feb 04 '25 02:02 jjshoots

@Milly37 so you can't pass the string to the make_vec_env like you would normally do for gym.make(), so in your case the make_vec_env is expecting a callable function, so it can call the function and return an environment to vectorize them. So instead, you would have to do something like I have above where you return an instance of an environment within a function. Sure, it is a little convoluted, but that's what SB3 expects. I.e. like this:

env_config = {
    'sparse_reward': False,
    'num_targets':  4,
    'goal_reach_distance': 4.0,
    'flight_mode': 0,
    'flight_dome_size': 150.0,
    ## increase time for env. termination
    'max_duration_seconds': 120.0,
    'angle_representation': "quaternion",
    'agent_hz': 30,
    'render_mode': None,
    'render_resolution': (480, 480),
}

def reg_env_creator(config):
     # callable function that will return an environment when SB3 calls it
    def create_env():
        # creates an environment and returns an instance of the environment
        env = QuadXWaypointsEnv(**config)
        context_length = config.get('context_length', 2)
        env = FlattenWaypointEnv(env, context_length)
        return env
    return create_env

env = make_vec_env(reg_env_creator(env_config), n_envs=4)

Change your config settings to what you need and this should now work. These are my settings for some fixed-wing stuff, but I threw the quad waypoints env in here for demonstration. Hope it helps!

Feb 05 '25 14:02 tlaurie99

I'm sorry to bother you again. I've tried as you said but I get this error: 'QuadXWaypointsEnv' is not defined

Feb 05 '25 16:02 Milly37

I think you need to import it?

At the top of the file: from PyFlyt.gym_envs.quadx_envs.quadx_waypoints_env import QuadXWaypointsEnv. :)

Feb 06 '25 09:02 jjshoots

Yes, I've changed import so many times that I forgot to rewrite that. I hope this will be the last time I bother you but at the moment I have problems with the visualization on my pc:

episodes = 7
    for _ in range(episodes):
        obs , _ = env.reset()
        done = trunc = False
        while not done and not trunc:
            
            action, _states = model.predict(obs)
            obs, reward, done, trunc, info = env.step(action)

I get two errors:

obs , _ = env.reset() ValueError: too many values to unpack (expected 2)
obs, reward, done, trunc, info = env.step(action) ValueError: not enough values to unpack (expected 5, got 4)

Feb 07 '25 12:02 Milly37

@Milly37 which version of gym are you using? You can do: gym.__version__ in a cell and execute it to get the version. If it is < 0.26 then that gym will only return obs and not obs, info. If it is >=0.26 then the issue is with how you are building the environment itself. Do you have that available?

Feb 07 '25 19:02 tlaurie99

EDIT: I have corrected some errors that I was making on my pc and simulations start now.

I have one last problem: even though reward is pretty good (I attach a photo from Tensorboard) running the last model from kaggle in the simulation the drone touches the first waypoint rarely.

Do you have any idea how to resolve this?

Feb 10 '25 10:02 Milly37

@tlaurie99 Hello! I'm a beginner in reinforcement learning and I hope to get your help. I would be very grateful if you could provide an example of training and testing a fixed-wing UAV to follow waypoints using PyFlyt and stable_baselines3. I'm looking forward to your reply and thank you very much!

Apr 10 '25 07:04 ZF113120

@ZF113120 something like this would work:

imports:

from stable_baselines3 import PPO
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.vec_env import SubprocVecEnv

from PyFlyt.gym_envs import FlattenWaypointEnv
from PyFlyt.gym_envs.fixedwing_envs.fixedwing_waypoints_env import FixedwingWaypointsEnv

environment creator for SB3 with PyFlyt:

def reg_env_creator(config):
     # callable function that SB3 expects to call and return an env. instance
    def create_env():
        env = FixedwingWaypointsEnv(**config)
        context_length = config.get('context_length', 2)
        # use_yaw_targets = config.get('use_yaw_targets', False)
        env = FlattenWaypointEnv(env, context_length)
        return env
    return create_env

with config:

env_config = {
    'sparse_reward': False,
    'num_targets':  4,
    'goal_reach_distance': 4.0,
    'flight_mode': 0,
    'flight_dome_size': 150.0,
    'max_duration_seconds': 30.0,
    'angle_representation': "quaternion",
    'agent_hz': 30,
}

params for SB3:

set_ctrl_freq = 40
num_envs = 10
n_steps = 402
learning_rate = 1e-3
batch_size = 4096
n_epochs = 10
total_timesteps = 10_000_000

vectorize the environment for SB3's parallelism:

env = make_vec_env(reg_env_creator(env_config),
                   n_envs=num_envs,
                   seed=0,
                   vec_env_cls=SubprocVecEnv,
                  )

set up SB3 model (with PPO):

model = PPO(
            ActorCriticPolicy,
            env, # this will be the vectorized env as above
            learning_rate=learning_rate,
            n_steps=n_steps,
            batch_size=batch_size,
            n_epochs=n_epochs,
            verbose=1,
            )

run the model:

model.learn(total_timesteps=total_timesteps)

Apr 14 '25 16:04 tlaurie99

@tlaurie99 Thank you very much for your help!

Wish you happiness every day!

Apr 15 '25 04:04 ZF113120

Thanks @tlaurie99 !

Apr 15 '25 04:04 jjshoots

@ZF113120 If you're still experiencing issues, maybe try commenting out this line.

Apr 18 '25 02:04 jjshoots

@tlaurie99 @jjshoots

I noticed a minor issue while running the following program: the terminal keeps flickering with "argv[0]=", yet it's not from a print or output statement. If you're interested, you could try running the code. I'd appreciate it if you could help me fix this issue in your free time to enhance PyFlyt.

import gymnasium as gym
from stable_baselines3 import PPO, A2C
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
import PyFlyt
import PyFlyt.gym_envs 
from PyFlyt.gym_envs import FlattenWaypointEnv

if __name__ == '__main__':
    set_ctrl_freq = 40
    num_envs = 10
    n_steps = 402
    learning_rate = 1e-3
    batch_size = 4096
    n_epochs = 10
    total_timesteps = 10_000_000

    env = gym.make("PyFlyt/Fixedwing-Waypoints-v4", render_mode=None)

    train_env = FlattenWaypointEnv(env, context_length=2)

    model = PPO(
                ActorCriticPolicy,
                train_env, # this will be the vectorized env as above
                learning_rate=learning_rate,
                n_steps=n_steps,
                batch_size=batch_size,
                n_epochs=n_epochs,
                verbose=1,
                )

    model.learn(total_timesteps=total_timesteps)

I'm looking forward to your reply!

Wish you happiness every day!

Apr 24 '25 06:04 ZF113120

Hi @ZF113120 , that printout is a native printout from PyBullet as your code spins up parallel environments. It goes away because this line deletes that printout again.

Apr 24 '25 07:04 jjshoots

@jjshoots Thanks for your prompt reply and for clearing up my doubts! I’m truly grateful!

Apr 24 '25 08:04 ZF113120