Beta Policy

Open AntoineRichard opened this issue 8 months ago • 0 comments

Hi there,

With this PR I propose to add a "Beta Policy" this policy is naturally bounded which provides nice guarantees when it comes to learning on constrained action spaces.

I had some issues with the automatic model instantiator. It works right now, but it expects that the user does not set the:

output: ACTIONS flag in the model definition. (that's because the model needs to have two heads, one to output alpha, the other to output beta. When with the GaussianMixin we only need the mean (as the std is a single parameter).

In any case, I'd be more than happy to make any modification you suggest. For now I only support pytorch since I don't have a Jax workflow to test things. On a side note I'm also looking into adding a squashed gaussian (SAC style) into the GaussianMixin to take into account bounded action spaces.

Let me know!

Cheers,

Antoine

Below is an example of configuration for it from IsaacLab:

seed: 42


# Models are instantiated using skrl's model instantiator utility
# https://skrl.readthedocs.io/en/latest/api/utils/model_instantiators.html
models:
  separate: True
  policy:  # see gaussian_model parameters
    class: BetaMixin
    network:
      - name: net
        input: STATES
        layers: [64, 64]
        activations: elu
  value:  # see deterministic_model parameters
    class: DeterministicMixin
    clip_actions: False
    network:
      - name: net
        input: STATES
        layers: [64, 64]
        activations: elu
    output: ONE


# Rollout memory
# https://skrl.readthedocs.io/en/latest/api/memories/random.html
memory:
  class: RandomMemory
  memory_size: -1  # automatically determined (same as agent:rollouts)


# PPO agent configuration (field names are from PPO_DEFAULT_CONFIG)
# https://skrl.readthedocs.io/en/latest/api/agents/ppo.html
agent:
  class: PPO
  rollouts: 32
  learning_epochs: 8
  mini_batches: 8
  discount_factor: 0.99
  lambda: 0.95
  learning_rate: 5.0e-4
  learning_rate_scheduler: KLAdaptiveLR
  learning_rate_scheduler_kwargs:
    kl_threshold: 0.008
  state_preprocessor: RunningStandardScaler
  state_preprocessor_kwargs: null
  value_preprocessor: RunningStandardScaler
  value_preprocessor_kwargs: null
  random_timesteps: 0
  learning_starts: 0
  grad_norm_clip: 1.0
  ratio_clip: 0.2
  value_clip: 0.2
  clip_predicted_values: True
  entropy_loss_scale: 0.0
  value_loss_scale: 2.0
  kl_threshold: 0.0
  rewards_shaper_scale: 0.1
  time_limit_bootstrap: False
  # logging and checkpoint
  experiment:
    directory: "jetbot_direct"
    experiment_name: ""
    write_interval: auto
    checkpoint_interval: auto
    wandb: True             # whether to use Weights & Biases
    wandb_kwargs:          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
      project: jetbot_direct
      entity: spacer-rl
      group: 'zeroG'
      notes: ''


# Sequential trainer
# https://skrl.readthedocs.io/en/latest/api/trainers/sequential.html
trainer:
  class: SequentialTrainer
  timesteps: 16000
  environment_info: log

Apr 03 '25 14:04 AntoineRichard