POMDPs.jl
POMDPs.jl copied to clipboard
ExplorationPolicies don't work with stepthrough
I'm trying to sample beliefs using the implemented exploration policies (SoftmaxPolicy and EspGreedyPolicy), but they don't work with stepthrough or the other simulator techniques that I've tried.
Steps to recreate:
using POMDPs
using POMDPModels
using POMDPTools
pomdp = TigerPOMDP()
policy = EpsGreedyPolicy(pomdp, 0.05)
beliefs = [b for b in stepthrough(pomdp, policy, DiscreteUpdater(pomdp), "b", max_steps=20)]
Error:
ERROR: MethodError: no method matching action(::EpsGreedyPolicy{POMDPTools.Policies.var"#20#21"{…}, Random.TaskLocalRNG, TigerPOMDP}, ::DiscreteBelief{TigerPOMDP, Bool})
Closest candidates are:
action(::Starve, ::Any)
@ POMDPModels ~/.julia/packages/POMDPModels/eZX2K/src/CryingBabies.jl:65
action(::FeedWhenCrying, ::Any)
@ POMDPModels ~/.julia/packages/POMDPModels/eZX2K/src/CryingBabies.jl:85
action(::AlwaysFeed, ::Any)
@ POMDPModels ~/.julia/packages/POMDPModels/eZX2K/src/CryingBabies.jl:69
...
Stacktrace:
[1] action_info(p::EpsGreedyPolicy{…}, x::DiscreteBelief{…})
@ POMDPTools.ModelTools ~/.julia/packages/POMDPTools/7Rekv/src/ModelTools/info.jl:12
[2] iterate
@ ~/.julia/packages/POMDPTools/7Rekv/src/Simulators/stepthrough.jl:91 [inlined]
[3] iterate
@ ~/.julia/packages/POMDPTools/7Rekv/src/Simulators/stepthrough.jl:85 [inlined]
[4] iterate
@ ./generator.jl:44 [inlined]
[5] grow_to!
@ ./array.jl:907 [inlined]
[6] collect(itr::Base.Generator{POMDPTools.Simulators.POMDPSimIterator{…}, typeof(identity)})
@ Base ./array.jl:831
[7] top-level scope
@ REPL[6]:1
Some type information was truncated. Use `show(err)` to see complete types.
I'm not sure about the history of the ExplorationPolicy abstract type, but it doesn't look like it is constructed to work with built-in simulators like stepthrough.
Most of the simulators call action_info(policy, state) to get the action (note: action_info calls action(policy, state) and returns nothing for the info by default: link).
From the documentation for the ExplorationPolicy type,
Sampling from an exploration policy is done using
action(exploration_policy, on_policy, k, state). Wherekis used to determine the exploration parameter.
Based on the current documentation, this behavior is expected. However, there is probably a good argument to redefine how we construct the exploration policies to include the on_policy and k as part of the struct. Then we could define action(policy::ExplorationPolicy, state) appropriately based on the above comment.
Since I am not familiar with the background here in the development, I am not confident about any secondary issues as it would be a breaking change since we would be redefining the structs of those policies.
Also, reference #497
Yeah, the exploration policy interface was designed for reinforcement learning solvers where the exploration should be decayed, but it is not really a Policy. I would not object to a re-design of that interface.
If you just want an epsilon greedy policy for a rollout. I'd recommend:
struct MyEpsGreedy{M, P} <: POMDPs.Policy
pomdp::M
original_policy::P
epsilon::Float64
end
function POMDPs.action(p::MyEpsGreedy, s)
if rand() < p.epsilon
return rand(actions(p.pomdp))
else
return action(p.original_policy, s)
end
end
policy = MyEpsGreedy(pomdp, original_policy, 0.05)
Closing. Please continue the discussion at https://github.com/JuliaPOMDP/POMDPs.jl/issues/497.