Beta Policy
Hi there,
With this PR I propose to add a "Beta Policy" this policy is naturally bounded which provides nice guarantees when it comes to learning on constrained action spaces.
I had some issues with the automatic model instantiator. It works right now, but it expects that the user does not set the:
output: ACTIONS flag in the model definition. (that's because the model needs to have two heads, one to output alpha, the other to output beta. When with the GaussianMixin we only need the mean (as the std is a single parameter).
In any case, I'd be more than happy to make any modification you suggest. For now I only support pytorch since I don't have a Jax workflow to test things. On a side note I'm also looking into adding a squashed gaussian (SAC style) into the GaussianMixin to take into account bounded action spaces.
Let me know!
Cheers,
Antoine
Below is an example of configuration for it from IsaacLab:
seed: 42
# Models are instantiated using skrl's model instantiator utility
# https://skrl.readthedocs.io/en/latest/api/utils/model_instantiators.html
models:
separate: True
policy: # see gaussian_model parameters
class: BetaMixin
network:
- name: net
input: STATES
layers: [64, 64]
activations: elu
value: # see deterministic_model parameters
class: DeterministicMixin
clip_actions: False
network:
- name: net
input: STATES
layers: [64, 64]
activations: elu
output: ONE
# Rollout memory
# https://skrl.readthedocs.io/en/latest/api/memories/random.html
memory:
class: RandomMemory
memory_size: -1 # automatically determined (same as agent:rollouts)
# PPO agent configuration (field names are from PPO_DEFAULT_CONFIG)
# https://skrl.readthedocs.io/en/latest/api/agents/ppo.html
agent:
class: PPO
rollouts: 32
learning_epochs: 8
mini_batches: 8
discount_factor: 0.99
lambda: 0.95
learning_rate: 5.0e-4
learning_rate_scheduler: KLAdaptiveLR
learning_rate_scheduler_kwargs:
kl_threshold: 0.008
state_preprocessor: RunningStandardScaler
state_preprocessor_kwargs: null
value_preprocessor: RunningStandardScaler
value_preprocessor_kwargs: null
random_timesteps: 0
learning_starts: 0
grad_norm_clip: 1.0
ratio_clip: 0.2
value_clip: 0.2
clip_predicted_values: True
entropy_loss_scale: 0.0
value_loss_scale: 2.0
kl_threshold: 0.0
rewards_shaper_scale: 0.1
time_limit_bootstrap: False
# logging and checkpoint
experiment:
directory: "jetbot_direct"
experiment_name: ""
write_interval: auto
checkpoint_interval: auto
wandb: True # whether to use Weights & Biases
wandb_kwargs: # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
project: jetbot_direct
entity: spacer-rl
group: 'zeroG'
notes: ''
# Sequential trainer
# https://skrl.readthedocs.io/en/latest/api/trainers/sequential.html
trainer:
class: SequentialTrainer
timesteps: 16000
environment_info: log