ReinforcementLearning.jl
ReinforcementLearning.jl copied to clipboard
SAC example experiment does not work
Hello,
I attempted to run SAC from the example experiment provided at https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/v0.10.1/src/ReinforcementLearningExperiments/deps/experiments/experiments/Policy%20Gradient/JuliaRL_SAC_Pendulum.jl (slightly modified for clarity). It is not learning a viable policy, though it runs without error. I am not familiar with the implementation details of SAC and I just wanted to try it out. It may be a hyper parameter tuning issue, or a bug. Here is the code:
using ReinforcementLearning
using StableRNGs
using Flux
using Flux.Losses
using IntervalSets
using CUDA
function RL.Experiment(
::Val{:JuliaRL},
::Val{:SAC},
::Val{:Pendulum},
::Nothing;
save_dir=nothing,
seed=123
)
rng = StableRNG(seed)
inner_env = PendulumEnv(T=Float32, rng=rng)
action_dims = inner_env.n_actions
A = action_space(inner_env)
low = A.left
high = A.right
ns = length(state(inner_env))
na = 1
env = ActionTransformedEnv(
inner_env;
action_mapping=x -> low + (x[1] + 1) * 0.5 * (high - low)
)
init = glorot_uniform(rng)
create_policy_net() = NeuralNetworkApproximator(
model=GaussianNetwork(
pre=Chain(
Dense(ns, 30, relu, init=init),
Dense(30, 30, relu, init=init),
),
μ=Chain(Dense(30, na, init=init)),
logσ=Chain(Dense(30, na, x -> clamp(x, typeof(x)(-10), typeof(x)(2)), init=init)),
),
optimizer=ADAM(0.003),
) |> gpu
create_q_net() = NeuralNetworkApproximator(
model=Chain(
Dense(ns + na, 30, relu; init=init),
Dense(30, 30, relu; init=init),
Dense(30, 1; init=init),
),
optimizer=ADAM(0.003),
) |> gpu
agent = Agent(
policy=SACPolicy(
policy=create_policy_net(),
qnetwork1=create_q_net(),
qnetwork2=create_q_net(),
target_qnetwork1=create_q_net(),
target_qnetwork2=create_q_net(),
γ=0.99f0,
τ=0.005f0,
α=0.2f0,
batch_size=64,
start_steps=1000,
start_policy=RandomPolicy(Space([-1.0 .. 1.0 for _ in 1:na]); rng=rng),
update_after=1000,
update_freq=1,
automatic_entropy_tuning=true,
action_dims=action_dims,
rng=rng,
device_rng=CUDA.functional() ? CUDA.CURAND.RNG() : rng
),
trajectory=CircularArraySARTTrajectory(
capacity=10000,
state=Vector{Float32} => (ns,),
action=Vector{Float32} => (na,),
),
)
stop_condition = StopAfterStep(10_000, is_show_progress=!haskey(ENV, "CI"))
hook = TotalRewardPerEpisode()
Experiment(agent, env, stop_condition, hook, "# Play Pendulum with SAC")
end
using Plots
ex = E`JuliaRL_SAC_Pendulum`
run(ex)
plot(ex.hook.rewards)
showTheThing(t, agent, env) = plot(env.env)
run(ex.policy, ex.env, StopAfterEpisode(10), DoEveryNStep(showTheThing))
Here is my versions:
Julia 1.7.2
[052768ef] CUDA v3.11.0
[587475ba] Flux v0.12.10
[158674fc] ReinforcementLearning v0.10.0
[e575027e] ReinforcementLearningBase v0.9.7
[de1b191a] ReinforcementLearningCore v0.8.11
[25e41dd2] ReinforcementLearningEnvironments v0.6.12
Any ideas on the cause?
I guess the parameters need tuning
@tyler-ingebrand, you can test that hypothesis by copying the hyperparameters in the experiment I found here : https://github.com/zhihanyang2022/pytorch-sac If identical HPs do not solve this then a problem in the implementation of SAC will be the likely explanation.
Have there been any developments around this issues? I am experiencing similar difficulties where the agent fails to learn a viable policy. I haven't been able to identify any bugs in the code.
In all my runs the reward graph is essentially the same as in the example, without any upwards improvement. (The reward curve also never really declines, so I wonder if that's related -- it's seemingly stationary.)
I've tried a number of things:
- running the example above for many more episodes
- tweaking the hyperparameters and checking the default options in SB3 and Spinning Up
- running SAC on my own environment
- modifying the code to follow similar conventions in the TD3 implementation (such as the
TD3Critic
type)
Thank you.
Any news on this ? It has been closed but I can't figure a way to make it learn anything valuable...
No, I closed it because it's inactive and we don't currently have enough people working on this package to 1) finish off the refactor and 2) investigate and handle older issues. Feel free to reopen this, but ideally either because there's someone who wants to take responsibility for investigating and fixing the issue or with the short term decision to drop the policy / example from the library.
@HenriDeh What do you think? I know it sounds harsh, but I don't believe that keeping non-working and non-maintained code in the package is of much benefit to users...
Yes, we need the right person to do this, and one for each of the algorithms in the zoo actually. In fact, for each algorithm, we need someone knowledgeable about both the algorithm (I don't know SAC very well, only the underlying idea) AND about the refactor (I think we are three maybe four in that case). This all the more difficult because the refactor is not documented which means one needs to read the source code to get the gist of it. I wish I had time for this but it's not the case at the moment, and regardless, this is one among all the algorithms that are broken. Getting this to work is also already in the scope of #614 I'd say.
Although I might suggest to try the master version as I recall fixing a bug with GaussianNetwork a while ago that may improve the correctness of learning. But my guess is that the issue is more profound than that.