POMDPs.jl icon indicating copy to clipboard operation
POMDPs.jl copied to clipboard

ExplorationPolicies don't work with stepthrough

Open FlyingWorkshop opened this issue 1 year ago • 3 comments

I'm trying to sample beliefs using the implemented exploration policies (SoftmaxPolicy and EspGreedyPolicy), but they don't work with stepthrough or the other simulator techniques that I've tried.

Steps to recreate:

using POMDPs
using POMDPModels
using POMDPTools

pomdp = TigerPOMDP()
policy = EpsGreedyPolicy(pomdp, 0.05)
beliefs = [b for b in stepthrough(pomdp, policy, DiscreteUpdater(pomdp), "b", max_steps=20)]

Error:

ERROR: MethodError: no method matching action(::EpsGreedyPolicy{POMDPTools.Policies.var"#20#21"{…}, Random.TaskLocalRNG, TigerPOMDP}, ::DiscreteBelief{TigerPOMDP, Bool})

Closest candidates are:
  action(::Starve, ::Any)
   @ POMDPModels ~/.julia/packages/POMDPModels/eZX2K/src/CryingBabies.jl:65
  action(::FeedWhenCrying, ::Any)
   @ POMDPModels ~/.julia/packages/POMDPModels/eZX2K/src/CryingBabies.jl:85
  action(::AlwaysFeed, ::Any)
   @ POMDPModels ~/.julia/packages/POMDPModels/eZX2K/src/CryingBabies.jl:69
  ...

Stacktrace:
 [1] action_info(p::EpsGreedyPolicy{…}, x::DiscreteBelief{…})
   @ POMDPTools.ModelTools ~/.julia/packages/POMDPTools/7Rekv/src/ModelTools/info.jl:12
 [2] iterate
   @ ~/.julia/packages/POMDPTools/7Rekv/src/Simulators/stepthrough.jl:91 [inlined]
 [3] iterate
   @ ~/.julia/packages/POMDPTools/7Rekv/src/Simulators/stepthrough.jl:85 [inlined]
 [4] iterate
   @ ./generator.jl:44 [inlined]
 [5] grow_to!
   @ ./array.jl:907 [inlined]
 [6] collect(itr::Base.Generator{POMDPTools.Simulators.POMDPSimIterator{…}, typeof(identity)})
   @ Base ./array.jl:831
 [7] top-level scope
   @ REPL[6]:1
Some type information was truncated. Use `show(err)` to see complete types.

FlyingWorkshop avatar Mar 06 '24 22:03 FlyingWorkshop

I'm not sure about the history of the ExplorationPolicy abstract type, but it doesn't look like it is constructed to work with built-in simulators like stepthrough.

Most of the simulators call action_info(policy, state) to get the action (note: action_info calls action(policy, state) and returns nothing for the info by default: link).

From the documentation for the ExplorationPolicy type,

Sampling from an exploration policy is done using action(exploration_policy, on_policy, k, state). Where k is used to determine the exploration parameter.

Based on the current documentation, this behavior is expected. However, there is probably a good argument to redefine how we construct the exploration policies to include the on_policy and k as part of the struct. Then we could define action(policy::ExplorationPolicy, state) appropriately based on the above comment.

Since I am not familiar with the background here in the development, I am not confident about any secondary issues as it would be a breaking change since we would be redefining the structs of those policies.

dylan-asmar avatar Mar 07 '24 23:03 dylan-asmar

Also, reference #497

dylan-asmar avatar Mar 07 '24 23:03 dylan-asmar

Yeah, the exploration policy interface was designed for reinforcement learning solvers where the exploration should be decayed, but it is not really a Policy. I would not object to a re-design of that interface.

If you just want an epsilon greedy policy for a rollout. I'd recommend:

struct MyEpsGreedy{M, P} <: POMDPs.Policy
    pomdp::M
    original_policy::P
    epsilon::Float64
end

function POMDPs.action(p::MyEpsGreedy, s)
    if rand() < p.epsilon
        return rand(actions(p.pomdp))
    else
        return action(p.original_policy, s)
    end
end

policy = MyEpsGreedy(pomdp, original_policy, 0.05)

zsunberg avatar Mar 11 '24 20:03 zsunberg

Closing. Please continue the discussion at https://github.com/JuliaPOMDP/POMDPs.jl/issues/497.

dylan-asmar avatar Mar 23 '24 20:03 dylan-asmar