ReinforcementLearning.jl
ReinforcementLearning.jl copied to clipboard
policy(env) returns no legal action -inf initialized Q-table
When initializing a Q-table with -inf
, it looks that EpsilonGreedyExplorer
might return no legal actions.
MWE follows. Define a custom environment similar to RandomWalk1D
Base.@kwdef mutable struct MyRandomWalk1D <: AbstractEnv
rewards::Pair{Float64,Float64} = -1.0 => 1.0
N::Int = 7
actions::Vector{Int} = [-1, 1]
start_pos::Int = (N + 1) ÷ 2
pos::Int = start_pos
end
RLBase.action_space(env::MyRandomWalk1D) = Base.OneTo(length(env.actions))
function (env::MyRandomWalk1D)(action)
env.pos = max(min(env.pos + env.actions[action], env.N), 1)
end
RLBase.state(env::MyRandomWalk1D) = env.pos
RLBase.state_space(env::MyRandomWalk1D) = Base.OneTo(env.N)
RLBase.is_terminated(env::MyRandomWalk1D) = env.pos == 1 || env.pos == env.N
RLBase.reset!(env::MyRandomWalk1D) = env.pos = env.start_pos
function RLBase.reward(env::MyRandomWalk1D)
if env.pos == 1
first(env.rewards)
elseif env.pos == env.N
last(env.rewards)
else
0.0
end
end
RLBase.NumAgentStyle(::MyRandomWalk1D) = SINGLE_AGENT
RLBase.DynamicStyle(::MyRandomWalk1D) = SEQUENTIAL
RLBase.InformationStyle(::MyRandomWalk1D) = PERFECT_INFORMATION
RLBase.StateStyle(::MyRandomWalk1D) = Observation{Int}()
RLBase.RewardStyle(::MyRandomWalk1D) = TERMINAL_REWARD
RLBase.UtilityStyle(::MyRandomWalk1D) = GENERAL_SUM
RLBase.ChanceStyle(::MyRandomWalk1D) = DETERMINISTIC
Basically, the only differences are:
RLBase.ActionStyle(::MyRandomWalk1D) = FULL_ACTION_SET
function RLBase.legal_action_space(env::MyRandomWalk1D, _)
findall(legal_action_space_mask(env))
end
function RLBase.legal_action_space_mask(ge::MyRandomWalk1D, _)
[false, true]
end
So basically the agent now is only allowed to go right.
julia> legal_action_space(env)
1-element Vector{Int64}:
2
Now create a policy.
env = MyRandomWalk1D()
NS, NA = length(state_space(env)), length(action_space(env))
policy = QBasedPolicy(
learner = MonteCarloLearner(;
approximator=TabularQApproximator(
;n_state = NS,
n_action = NA,
init = -Inf,
opt = InvDecay(1.0)
)
),
explorer = EpsilonGreedyExplorer(0.01)
)
And poll it maaaany times:
> any([policy(env) == 1 for _ in 1:1_000_000]) # this should return false
true
policy(env)
should never return 1 as it's an illegal action.
I couldn't reproduce with init=0.0
Q-table.
Can you look at the EpsilonGreedyExplorer(0.01)
object and see if you can narrow down the minimum working example to just this object? (https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningCore/src/policies/explorers/epsilon_greedy_explorer.jl) Might it have something to do with not correctly applying the mask?
@filchristou This is a nice catch...ultimately it would be awesome if we can get a fix & a new unit test out of this issue. :)
Probably the mask is not applied, yes. maybe it simply doesn't implement the trait (?) I can give a look and try to do a PR. (Last time I tried but I got a bit overwhelmed by the complexity :grimacing: and the lack of (easy local) testing was disarming)
Hopefully #843 will lead us to a better solution for local testing, but in the meantime ReinforcementLearning.activate_devmode!()
should be working again.