ReinforcementLearning.jl icon indicating copy to clipboard operation
ReinforcementLearning.jl copied to clipboard

SAC example experiment does not work

Open tyler-ingebrand opened this issue 2 years ago • 3 comments

Hello,

I attempted to run SAC from the example experiment provided at https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/v0.10.1/src/ReinforcementLearningExperiments/deps/experiments/experiments/Policy%20Gradient/JuliaRL_SAC_Pendulum.jl (slightly modified for clarity). It is not learning a viable policy, though it runs without error. I am not familiar with the implementation details of SAC and I just wanted to try it out. It may be a hyper parameter tuning issue, or a bug. Here is the code:

using ReinforcementLearning
using StableRNGs
using Flux
using Flux.Losses
using IntervalSets
using CUDA

function RL.Experiment(
    ::Val{:JuliaRL},
    ::Val{:SAC},
    ::Val{:Pendulum},
    ::Nothing;
    save_dir=nothing,
    seed=123
)
    rng = StableRNG(seed)
    inner_env = PendulumEnv(T=Float32, rng=rng)
    action_dims = inner_env.n_actions
    A = action_space(inner_env)
    low = A.left
    high = A.right
    ns = length(state(inner_env))
    na = 1

    env = ActionTransformedEnv(
        inner_env;
        action_mapping=x -> low + (x[1] + 1) * 0.5 * (high - low)
    )
    init = glorot_uniform(rng)

    create_policy_net() = NeuralNetworkApproximator(
        model=GaussianNetwork(
            pre=Chain(
                Dense(ns, 30, relu, init=init),
                Dense(30, 30, relu, init=init),
            ),
            μ=Chain(Dense(30, na, init=init)),
            logσ=Chain(Dense(30, na, x -> clamp(x, typeof(x)(-10), typeof(x)(2)), init=init)),
        ),
        optimizer=ADAM(0.003),
    ) |> gpu

    create_q_net() = NeuralNetworkApproximator(
        model=Chain(
            Dense(ns + na, 30, relu; init=init),
            Dense(30, 30, relu; init=init),
            Dense(30, 1; init=init),
        ),
        optimizer=ADAM(0.003),
    ) |> gpu

    agent = Agent(
        policy=SACPolicy(
            policy=create_policy_net(),
            qnetwork1=create_q_net(),
            qnetwork2=create_q_net(),
            target_qnetwork1=create_q_net(),
            target_qnetwork2=create_q_net(),
            γ=0.99f0,
            τ=0.005f0,
            α=0.2f0,
            batch_size=64,
            start_steps=1000,
            start_policy=RandomPolicy(Space([-1.0 .. 1.0 for _ in 1:na]); rng=rng),
            update_after=1000,
            update_freq=1,
            automatic_entropy_tuning=true,
            action_dims=action_dims,
            rng=rng,
            device_rng=CUDA.functional() ? CUDA.CURAND.RNG() : rng
        ),
        trajectory=CircularArraySARTTrajectory(
            capacity=10000,
            state=Vector{Float32} => (ns,),
            action=Vector{Float32} => (na,),
        ),
    )

    stop_condition = StopAfterStep(10_000, is_show_progress=!haskey(ENV, "CI"))
    hook = TotalRewardPerEpisode()
    Experiment(agent, env, stop_condition, hook, "# Play Pendulum with SAC")
end

using Plots
ex = E`JuliaRL_SAC_Pendulum`
run(ex)
plot(ex.hook.rewards)

showTheThing(t, agent, env) = plot(env.env)
run(ex.policy, ex.env, StopAfterEpisode(10), DoEveryNStep(showTheThing))

Here is my versions:

Julia 1.7.2
  [052768ef] CUDA v3.11.0
  [587475ba] Flux v0.12.10
  [158674fc] ReinforcementLearning v0.10.0
  [e575027e] ReinforcementLearningBase v0.9.7
  [de1b191a] ReinforcementLearningCore v0.8.11
  [25e41dd2] ReinforcementLearningEnvironments v0.6.12
 

Any ideas on the cause?

tyler-ingebrand avatar Jul 18 '22 16:07 tyler-ingebrand

I guess the parameters need tuning

findmyway avatar Jul 19 '22 02:07 findmyway

@tyler-ingebrand, you can test that hypothesis by copying the hyperparameters in the experiment I found here : https://github.com/zhihanyang2022/pytorch-sac If identical HPs do not solve this then a problem in the implementation of SAC will be the likely explanation.

HenriDeh avatar Jul 19 '22 08:07 HenriDeh

Have there been any developments around this issues? I am experiencing similar difficulties where the agent fails to learn a viable policy. I haven't been able to identify any bugs in the code.

In all my runs the reward graph is essentially the same as in the example, without any upwards improvement. (The reward curve also never really declines, so I wonder if that's related -- it's seemingly stationary.)

I've tried a number of things:

  • running the example above for many more episodes
  • tweaking the hyperparameters and checking the default options in SB3 and Spinning Up
  • running SAC on my own environment
  • modifying the code to follow similar conventions in the TD3 implementation (such as the TD3Critic type)

Thank you.

NPLawrence avatar Sep 09 '22 22:09 NPLawrence

Any news on this ? It has been closed but I can't figure a way to make it learn anything valuable...

yosinlpet avatar May 24 '23 13:05 yosinlpet

No, I closed it because it's inactive and we don't currently have enough people working on this package to 1) finish off the refactor and 2) investigate and handle older issues. Feel free to reopen this, but ideally either because there's someone who wants to take responsibility for investigating and fixing the issue or with the short term decision to drop the policy / example from the library.

jeremiahpslewis avatar May 24 '23 14:05 jeremiahpslewis

@HenriDeh What do you think? I know it sounds harsh, but I don't believe that keeping non-working and non-maintained code in the package is of much benefit to users...

jeremiahpslewis avatar May 24 '23 14:05 jeremiahpslewis

Yes, we need the right person to do this, and one for each of the algorithms in the zoo actually. In fact, for each algorithm, we need someone knowledgeable about both the algorithm (I don't know SAC very well, only the underlying idea) AND about the refactor (I think we are three maybe four in that case). This all the more difficult because the refactor is not documented which means one needs to read the source code to get the gist of it. I wish I had time for this but it's not the case at the moment, and regardless, this is one among all the algorithms that are broken. Getting this to work is also already in the scope of #614 I'd say.

HenriDeh avatar May 30 '23 09:05 HenriDeh

Although I might suggest to try the master version as I recall fixing a bug with GaussianNetwork a while ago that may improve the correctness of learning. But my guess is that the issue is more profound than that.

HenriDeh avatar May 30 '23 09:05 HenriDeh