ReinforcementLearning.jl icon indicating copy to clipboard operation
ReinforcementLearning.jl copied to clipboard

estimate v.s. basis in policies

Open baedan opened this issue 3 years ago • 14 comments

today i was trying estimate the state values of a policy using off-policy n-step TD. as far as i can tell, i need to use a VBasedPolicy to represent my target policy, but whose learner would be the state value estimates. i would then define the policy entirely in VBasedPolicy.mapping. in other words, this VBasedPolicy would not actually be based on state values. the state value learner is instead repurposed for estimation. the same applies if i were to estimate action value.

this is a problem because the usual interfaces of AbstractPolicy cannot be used -- in particular, RLBase.prob(), which is critical for calculating importance-sampling ratio during off-policy evaluation. and if i want to estimate action values while off-policy, i don't know what i would do: where would my target policy be specified?

how do you think this can be handled? maybe have a field in Agent() that specifies whether it's running in prediction/estimate mode and what kind of estimate it's running, and put the estimate in a hook? idk, that seems a bit ugly. the logic for estimation and improvement are so similar that there has to be more elegant solution, i feel.

baedan avatar Jun 08 '22 11:06 baedan

You can take a look at OffPolicy:

https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/2e1de3e5b6b8224f50b3d11bba7e1d2d72c6ef7c/src/ReinforcementLearningZoo/src/algorithms/tabular/off_policy.jl#L3-L6

It's not easy to tell which mode the policy is running, so I usually explicitly extract the behavior policy or target policy when necessary.

findmyway avatar Jun 09 '22 02:06 findmyway

i know about OffPolicy, but i don't how it helps here. if i were to estimate the action or state values of a particular policy, both the estimates and the policy itself would need to be stored in π_target.

https://github.com/JuliaReinforcementLearning/ReinforcementLearningAnIntroduction.jl/blob/c0d1d2038152188f4a4ad367b573f8cfb4243530/notebooks/Chapter05_Blackjack.jl#L299-L320

as an example, in order to use MC method for offline evaluation of state values, you would set π_target to a VBasedPolicy, whose learner would be storing the tables used for estimation, and whose mapping would be the actual policy being evaluated.

https://github.com/JuliaReinforcementLearning/ReinforcementLearningAnIntroduction.jl/blob/c0d1d2038152188f4a4ad367b573f8cfb4243530/notebooks/Chapter05_Blackjack.jl#L285-L296

and the RLBase.prob() would need to be defined ad hoc like this.

as far as i can tell, this all wouldn't be possible if i were to estimate action values of a policy while off-policy, right? QBasedPolicy does not have a field where i could define a mapping ad hoc.

baedan avatar Jun 09 '22 07:06 baedan

I think the problem here is specific to the VBasedPolicy.

With a state value estimator only, we don't know how to generate actions (or action distribution), that's why we defined a companion mapping in it. Considering that VBasedPolicy is over general, it's hard to define a default implementation for RLBase.prob. So for each concrete instance, we have to manually define the interface for it (I guess that's what you mean by ad-hoc above).

For QBasedPolicy, our basic assumption is that a learner is used to predict the action values and an explorer is used to sample actions (or even provide the action probability). So we do not need the concept of mapping in VBasedPolicy (or, you can say the explorer is a kind of specific mapping here)

... both the estimates and the policy itself would need to be stored in π_target...

By design, the estimators are always wrapped in a policy. So I'm not sure why that would be a problem here.

Let me know if anything is still unclear to you.

findmyway avatar Jun 09 '22 09:06 findmyway

hm, i might be missing something here, so just in case that's the case, i'll ask a clarifying question that should make it clear one way or the other:

what would you do to evaluate the action values of a given policy (either an AbstractPolicy or like a simple map from environment to action) while off-policy?

as far as i can tell, it's not possible. if it were, i would pass something like this to Agent():

OffPolicy(
    π_target = QBasedPolicy(
        learner = MonteCarloLearner(...), #where the estimates are stored
        explorer = explorer
    ),
    π_behavior = RandomPolicy(action_space(env))
)

but there's nowhere to pass the target policy. i mean, with a little bit of work, potentially it could be passed to explorer, i suppose?

baedan avatar Jun 09 '22 10:06 baedan

what would you do to evaluate the action values of a given policy

Well, that mainly depends on the target policy you defined. So, if it is a QBasedPolice like you defined above, then it's well defined by the inner learner. But if it is something else, (like the VBasedPolicy), then we have to count on that the action value can be calculated from it.

findmyway avatar Jun 09 '22 10:06 findmyway

but if i want to evaluate the action values, i have to use a QBasedPolicy for it to work, regardless of what policy i actually want to evaluate, no? in the same way, if i want to evaluate state values, i have to use a VBasedPolicy.

So, if it is a QBasedPolice like you defined above, then it's well defined by the inner learner.

but the learner is where the estimates are continuously stored and updated during a trial. it doesn't specify the target policy (which might be stationary); it's a function of the target policy.

OffPolicy(
    π_target = VBasedPolicy(
        learner=MonteCarloLearner(
            approximator=(
                TabularVApproximator(;n_state=NS, opt=Descent(1.0)), 
                TabularVApproximator(;n_state=NS, opt=InvDecay(1.0))
            ),
            kind=FIRST_VISIT,
            sampling=ORDINARY_IMPORTANCE_SAMPLING
        ),
        mapping=target_policy_mapping
    ),
    π_behavior = RandomPolicy(Base.OneTo(2))
)

in this example, the learner we pass to VBasedPolicy is empty (or initialized in whatever way). it doesn't matter what policy i'm evaluating, because i need to pass it through mapping anyway. and because i want to evaluate state values, i have to use a VBasedPolicy.

baedan avatar Jun 09 '22 10:06 baedan

i feel like we're talking past each other, and there's gotta be some concepts that we are defining very differently, haha

baedan avatar Jun 09 '22 11:06 baedan

i feel like we're talking past each other, and there's gotta be some concepts that we are defining very differently, haha

Haha, I think the discussions here will help us both gain a better understanding of how to design a set of more intuitive interfaces.


but if i want to evaluate the action values, i have to use a QBasedPolicy for it to work, regardless of what policy i actually want to evaluate, no?

Actually, no.

QBasedPolicy is just a generalized view over a collection of policies which focus on the Q-values. But it doesn't mean one can only use the QBasedPolicy if he/she wants to leverage the Q-values. A typical example is the actor-critic related algorithms.

https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/2e1de3e5b6b8224f50b3d11bba7e1d2d72c6ef7c/src/ReinforcementLearningZoo/src/algorithms/policy_gradient/A2C.jl#L15-L16

We do not have a learner here, just a raw approximator which is an ActorCritic structure. And it can provide both q-values and v-values.

So my point here is that, those general policies (like QBasedPolicies, VBasedPolices ) provided by RL.jl have some undocumented scopes. When it doesn't fit some specific scenarios, I'd usually consider:

  1. Add more specific implementations to reuse existing code as much as possible. ( Those built-in policies are scaffolds, not shackles. )
  2. Define a new dedicated policy for it. (New APIs in the next release will make it very easy)

findmyway avatar Jun 09 '22 13:06 findmyway

thanks for the responses. i'm tapped out for the day, will think about it more tomorrow

baedan avatar Jun 09 '22 13:06 baedan

i think i understand the disconnect now.

a policy in this package is not just the classically defined, stationary map from a state to a probability distribution over the action space (let's call it a raw policy). it also defines the continuous process in which it interacts with the environment and modifies itself. a QBasedPolicy equipped with a TDLearner improves the underlying raw policy over time with the TD method. even though the way this method can be understood in the GPI framework, it does so without a notion of prediction. however, we can force it to perform prediction without improvement in the case of VBasedPolicy by specifying our stationary raw policy within mapping, and retrieve the value estimations from the approximator. but there's no easy way to do this with QBasedPolicy, i think.

i was trying to come up with a way to do this while leveraging existing code. my first inclination was to define a wrapper PredictionPolicy::AbstractPolicy, which has two fields: a default/empty ::AbstractPolicy (such as VBasedPolicy); and a target policy ::AbstractPolicy which will simply be used as a raw policy. the first field will then both specify the algorithm of the prediction and store the prediction result. this will require changes in VBasedPolicy, QBasedPolicy and even OffPolicy, however. so what if we put an optional field in V/QBasedPolicy that specifies a target raw policy, whose presence means it is running in prediction mode?

baedan avatar Jun 14 '22 11:06 baedan

Almost there.

a policy in this package is not just the classically defined, stationary map from a state to a probability distribution over the action space (let's call it a raw policy).

Correct. As you've found, this is kind of counter-intuitive. So I made a change in the upcoming release. Each policy is a raw policy as you mentioned above.

The policy prediction and policy improvement steps are explicitly separated.

https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/c70f4f057202d6e02a39399458160002756b63ed/src/ReinforcementLearningBase/src/interface.jl#L40

https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/c70f4f057202d6e02a39399458160002756b63ed/src/ReinforcementLearningCore/src/core/stages.jl#L20

In prior releases, (p::Agent)(env) might change its internal experience replay buffer and update (improve) its internal policies. In the future, all policies act like a raw policy you mentioned above, (p::AbstractPolicy)(env) should never improve its internal policies. Users are encouraged to call optimise!(p::AbstractPolicy) explicitly to improve the policies.

so what if we put an optional field in V/QBasedPolicy that specifies a target raw policy, whose presence means it is running in prediction mode?

Actually, the Agent type is just for the purpose you mentioned in your second paragraph. It serves as a wrapper around arbitrary raw policy and provides a trajectory (the experience replay buffer) to improve itself.

When we want to turn the Agent from improvement mode into prediction mode, we simply extract out the inner policy (remember that the inner policy is a raw policy) for evaluation.

findmyway avatar Jun 14 '22 13:06 findmyway

well, i can't wait for the next release. :D

one thing though: what i mean by policy evaluation or prediction is not (p::AbstractPolicy)(env), but evaluating the state/action values of a policy, as defined in sutton and barto's book. in their framework of generalized policy iteration, two processes drive a policy towards optimal: policy evaluation (determining the state/action values of a policy) and policy improvement (update the policy such that it's ~greedy w.r.t. those state/action values). you know this of course, but reading the thread back, i'm realizing that the terminology has been a huge source of confusion.

https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/639717388fb41199c98b90406bea76232bc6294d/src/ReinforcementLearningZoo/src/algorithms/tabular/policy_iteration.jl#L18-L41

the DP policy improvement algorithm is a special case of this. conceptually, a policy evaluation pass should return an Approximator that contains the estimates.

When we want to turn the Agent from improvement mode into prediction mode, we simply extract out the inner policy (remember that the inner policy is a raw policy) for evaluation.

do you mean this is the case now? i would argue that a policy wrapped by an Agent is not a raw policy, as it specifies how it would like to interact with an environment continuously, even though it can't by itself.

baedan avatar Jun 14 '22 17:06 baedan

one thing though: what i mean by policy evaluation or prediction is not (p::AbstractPolicy)(env), but evaluating the state/action values of a policy, as defined in sutton and barto's book. in their framework of generalized policy iteration, two processes drive a policy towards optimal: policy evaluation (determining the state/action values of a policy) and policy improvement (update the policy such that it's ~greedy w.r.t. those state/action values). you know this of course, but reading the thread back, i'm realizing that the terminology has been a huge source of confusion.

Now I see. Indeed, the terminology has been a huge source of confusion here in our discussions above.

policy_evaluation and policy_improvement are implemented at the very early stage of this package. So they do not fit into the concept of AbstractPolicy and the Base.run function very well. And now I get why you'd like to have an OffPolicy like structure to simulate the evaluation and improvement steps.

Now back to your original question.

how do you think this can be handled? maybe have a field in Agent() that specifies whether it's running in prediction/estimate mode and what kind of estimate it's running, and put the estimate in a hook? idk, that seems a bit ugly. the logic for estimation and improvement are so similar that there has to be more elegant solution, i feel.

TBH, I also do not have a very elegant way to handle it. I think maybe you can borrow some ideas from the Double DQN

https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/c70f4f057202d6e02a39399458160002756b63ed/src/ReinforcementLearningCore/src/utils/networks.jl#L384-L415

In the policy evaluation stage, we update the target network. But in the policy improvement stage, we sync the target network (or approximator) and update the mapping or the explorer.

findmyway avatar Jun 15 '22 01:06 findmyway

thanks, i'll look into it. all in all, this thread did make me consider more deeply the various aspects of design, which are indeed challenging.

baedan avatar Jun 17 '22 04:06 baedan