ReinforcementLearning.jl questions while looking at implementation of VPG

hello! while going through vpg.jl i had some odds-and-ends questions. still pretty new to julia and especially Flux.jl, so please bear with me :D

i don't understand the point of having separate NeuralNetworkApproximator and LinearApproximator types: it seems to me that they are no more than wrappers for a approximator function, which can be of whatever type, as well as an optimizer. conceptually, they do the same things: forward pass, optimize. shouldn't this be abstracted away as a FunctionApproximator?
related to 1, loss functions are hardcoded here, as opposed to being a parameter of an Approximator.
should we use something like Flux.Losses.logitcrossentropy rather than hand-coding it?
dist is a weird field. the docstring "distribution function of the action" isn't much help, either. the example of it running an experiment used dist = Categorical gave me some clue; but in any case, this seems to me like something that should be abstracted.
action_space looks weird to me as something you pass to the policy. when sampling an action given an environment policy(env), we can retrieve action_space from the environment. when optimizing a policy, we only need to know if the approximator is continuous or discrete (and it feels like it should be abstracted away anyway?) additionally, this is problematic when the environment is of FULL_ACTION_SET.
related to 5, is the following good julia style? does the advice here apply? https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/318f0f1d2868adcb96050fbfe5894509287a3d37/src/ReinforcementLearningZoo/src/algorithms/policy_gradient/vpg.jl#L55-L63
the two fields loss and baseline_loss aren't used at all, so i assume they are meant to be exposed for logging purposes. should they be put inside an approximator instead?

Jul 11 '22 10:07 baedan

Hi @baedan ,

Glad to see you again here! 🤗

i don't understand the point of having separate NeuralNetworkApproximator and LinearApproximator types: it seems to me that they are no more than wrappers for a approximator function, which can be of whatever type, as well as an optimizer. conceptually, they do the same things: forward pass, optimize. shouldn't this be abstracted away as a FunctionApproximator?

That was for historical reasons. In the master branch, we have only one Approximator defined. See https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/master/src/ReinforcementLearningCore/src/policies/learners.jl for more details.

related to 1, loss functions are hardcoded here, as opposed to being a parameter of an Approximator.

In many cases, the choice of loss functions are policy specific. So I didn't add it into the Approximator.

should we use something like Flux.Losses.logitcrossentropy rather than hand-coding it?

Yes.

dist is a weird field. the docstring "distribution function of the action" isn't much help, either. the example of it running an experiment used dist = Categorical gave me some clue; but in any case, this seems to me like something that should be abstracted.

Here dist is to generate the action distribution based on the output from the model (in the approximator). I'm afraid it is not that easy to turn it into an abstract part. Because not all the algorithms support both continuous and discrete action spaces.

action_space looks weird to me as something you pass to the policy. when sampling an action given an environment policy(env), we can retrieve action_space from the environment. when optimizing a policy, we only need to know if the approximator is continuous or discrete (and it feels like it should be abstracted away anyway?) additionally, this is problematic when the environment is of FULL_ACTION_SET.

Agree. We may remove it later.

related to 5, is the following good julia style? does the advice here apply? https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/318f0f1d2868adcb96050fbfe5894509287a3d37/src/ReinforcementLearningZoo/src/algorithms/policy_gradient/vpg.jl#L55-L63

Personally, I think the advice doesn't apply here. Because we have some extra lines above:

https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/318f0f1d2868adcb96050fbfe5894509287a3d37/src/ReinforcementLearningZoo/src/algorithms/policy_gradient/vpg.jl#L51-L53

This means in each dispatch, we need to copy & paste them (or wrap it into a helper function). So we simply use if else here.

Also note the last sentence in the advice above:

It should however be noted that the compiler is quite efficient at optimizing away the dead branches in code written as the mynorm example.

I believe it can be optimized here. Because the type of VPGPolicy.action_space is included as a parameter of VPGPolicy. Anyway, you should stick with the advice in most cases.

the two fields loss and baseline_loss aren't used at all, so i assume they are meant to be exposed for logging purposes. should they be put inside an approximator instead?

They are used for logging during training. I understand they are kind of strange to be included in the definition of a policy. This issue will be addressed in https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/issues/725 soon.

Jul 11 '22 14:07 findmyway

Glad to see you again here! 🤗

i've been giving myself a crash course on deep learning :D mostly done with that now.

good to hear about Approximator.

dist and action space type do seem a bit tricky. i do think the snippet in 6 could probably be put into a function. the function would take parameters as input, a dist function, dispatches on action space types, and returns a Distribution object. (note that currently, the semantics of dist is different, depending on action space type. for discrete action space, it means the function that's applied after NN output gets softmaxed. for continuous action space, there's no softmax step.) by default, discrete action space would apply softmax + Categorical; for continuous action space, it would be something like Gaussian($\theta$...).

this would tighten up loss calculation as well: https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/318f0f1d2868adcb96050fbfe5894509287a3d37/src/ReinforcementLearningZoo/src/algorithms/policy_gradient/vpg.jl#L131-L144

Jul 12 '22 05:07 baedan

some more stray thoughts:

i don't think this line works when the action space is a cartesian product of discrete spaces, right? in that case, if i remember correctly, array index would just return the first dimension. https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/318f0f1d2868adcb96050fbfe5894509287a3d37/src/ReinforcementLearningZoo/src/algorithms/policy_gradient/vpg.jl#L57
it's funny, i think you can pass the appropriate dist to VPG so that it becomes DQN, depending on the loss function

Jul 12 '22 06:07 baedan