ReinforcementLearning.jl policy(env) returns no legal action -inf initialized Q-table

When initializing a Q-table with -inf, it looks that EpsilonGreedyExplorer might return no legal actions.

MWE follows. Define a custom environment similar to RandomWalk1D

Base.@kwdef mutable struct MyRandomWalk1D <: AbstractEnv
    rewards::Pair{Float64,Float64} = -1.0 => 1.0
    N::Int = 7
    actions::Vector{Int} = [-1, 1]
    start_pos::Int = (N + 1) ÷ 2
    pos::Int = start_pos
end

RLBase.action_space(env::MyRandomWalk1D) = Base.OneTo(length(env.actions))

function (env::MyRandomWalk1D)(action)
    env.pos = max(min(env.pos + env.actions[action], env.N), 1)
end

RLBase.state(env::MyRandomWalk1D) = env.pos
RLBase.state_space(env::MyRandomWalk1D) = Base.OneTo(env.N)
RLBase.is_terminated(env::MyRandomWalk1D) = env.pos == 1 || env.pos == env.N
RLBase.reset!(env::MyRandomWalk1D) = env.pos = env.start_pos

function RLBase.reward(env::MyRandomWalk1D)
    if env.pos == 1
        first(env.rewards)
    elseif env.pos == env.N
        last(env.rewards)
    else
        0.0
    end
end

RLBase.NumAgentStyle(::MyRandomWalk1D) = SINGLE_AGENT
RLBase.DynamicStyle(::MyRandomWalk1D) = SEQUENTIAL

RLBase.InformationStyle(::MyRandomWalk1D) = PERFECT_INFORMATION
RLBase.StateStyle(::MyRandomWalk1D) = Observation{Int}()
RLBase.RewardStyle(::MyRandomWalk1D) = TERMINAL_REWARD
RLBase.UtilityStyle(::MyRandomWalk1D) = GENERAL_SUM
RLBase.ChanceStyle(::MyRandomWalk1D) = DETERMINISTIC

Basically, the only differences are:

RLBase.ActionStyle(::MyRandomWalk1D) = FULL_ACTION_SET

function RLBase.legal_action_space(env::MyRandomWalk1D, _)
    findall(legal_action_space_mask(env))
end

function RLBase.legal_action_space_mask(ge::MyRandomWalk1D, _)
    [false, true]
end

So basically the agent now is only allowed to go right.

julia> legal_action_space(env)
1-element Vector{Int64}:
 2

Now create a policy.

env = MyRandomWalk1D()
NS, NA = length(state_space(env)), length(action_space(env))

policy = QBasedPolicy(
           learner = MonteCarloLearner(;
                   approximator=TabularQApproximator(
                       ;n_state = NS,
                       n_action = NA,
                       init = -Inf,
                       opt = InvDecay(1.0)
                   )
               ),
           explorer = EpsilonGreedyExplorer(0.01)
       )

And poll it maaaany times:

> any([policy(env) == 1 for _ in 1:1_000_000]) # this should return false
true

policy(env) should never return 1 as it's an illegal action.

I couldn't reproduce with init=0.0 Q-table.

Mar 24 '23 15:03 filchristou

Can you look at the EpsilonGreedyExplorer(0.01) object and see if you can narrow down the minimum working example to just this object? (https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningCore/src/policies/explorers/epsilon_greedy_explorer.jl) Might it have something to do with not correctly applying the mask?

Mar 24 '23 16:03 jeremiahpslewis

@filchristou This is a nice catch...ultimately it would be awesome if we can get a fix & a new unit test out of this issue. :)

Mar 24 '23 16:03 jeremiahpslewis

Probably the mask is not applied, yes. maybe it simply doesn't implement the trait (?) I can give a look and try to do a PR. (Last time I tried but I got a bit overwhelmed by the complexity :grimacing: and the lack of (easy local) testing was disarming)

Mar 24 '23 17:03 filchristou

Hopefully #843 will lead us to a better solution for local testing, but in the meantime ReinforcementLearning.activate_devmode!() should be working again.

Mar 24 '23 18:03 jeremiahpslewis

ReinforcementLearning.jl ReinforcementLearning.jl copied to clipboard

policy(env) returns no legal action -inf initialized Q-table

ReinforcementLearning.jl
ReinforcementLearning.jl copied to clipboard