stable-baselines [question] PPO2 pretrain always resets weights?

What I am trying to do is to perform pertaining on previously trained model. However, when I was using the pretrain methods from PPO2, I observed these strange behaviours.

First, this method works well:

construct a PPO2 model from scratch by using: model = PPO2(...)
perform pertaining with expert dataset using model.pretrain(...)
then further improve the model using model.learn(...)

Then, I did this, which didn't make sense to me:

loaded a previously trained PPO2 model using: model = PPO2.load(...)
perform pertaining on the expert dataset using model.pretrain(...) with learning_rate = 0
observed weight parameters changes

Next, I tested this:

construct a PPO2 model from scratch
perform pertaining on the expert dataset with learning_rate = 0
didn't observe weight parameters changes

env: CartPole-v1 version: stable-baselines 2.9.0

So my hypothesis is that the pretrain method cannot be used on previously trained models. Is this intentional? Intuitively, I feel like this is a bug but I am not too sure.

Can someone explain to me?

Jul 02 '20 22:07 SolaWeng

Yes, this seems to be by design. I imagine the intuition is in the naming pretrain, but also agree it is somewhat unexpected. Personally I believe even slight pretraining after RL training will not work too well, partially because it can easily destroy whatever RL training achieved (personal experience) and it will also offset the value function from policy, so I am not sure if "pretraining after RL training" should be allowed unless done carefully.

In any case, you can try it by commenting out that line and manually making sure variables have been initialized.

@AdamGleave I believe this code is from your side. Any thoughts on skipping init if model was already initialized, or should we prevent/warn about using pretrain after train?

Jul 02 '20 22:07 Miffyli

Thanks for the fast reply. @Miffyli

It seems that pretrain updates both the value network and policy network when using the expert data with generate_expert_traj() format which includes proper reward information. I guess this way it should have less effect on pushing value away run policy.

Personally, I am trying a new way to solve complex RL problems by breaking down a single problem to a series of smaller problems. For each smaller problem, there is an expert data set from experiment. I am hoping to obtain better convergence to a policy with desired behaviour since the problem may have many local optima with similar reward measure.

I will take a look at the source code for initialization. Thanks again!

Jul 02 '20 22:07 SolaWeng

@AdamGleave I believe this code is from your side. Any thoughts on skipping init if model was already initialized, or should we prevent/warn about using pretrain after train?

I don't think I was involved with this code, or if I was I've forgotten about it ;)

Forcibly initializing in pretrain does seem unnecessarily aggressive, although I agree with you that even without this pretrain after training is likely to perform poorly at least without some careful tuning. I think the variables should already be initialized in the setup_model() of most algorithms so we shouldn't need to do it again in pretrain(). I believe this code came from https://github.com/hill-a/stable-baselines/pull/206 so worth checking whether @araffin had a particular reason for wanting the initialization here.

Jul 02 '20 23:07 AdamGleave

I don't think I was involved with this code, or if I was I've forgotten about it ;)

Yes, that was me. But @AdamGleave has a better repo for imitation learning (https://github.com/HumanCompatibleAI/imitation).

worth checking whether @araffin had a particular reason for wanting the initialization here.

I don't remember any particular reason except avoid errors? (not sure) If you remove that lines and CI passes, then that would be an easy fix ;)

As mentioned before, the idea of pre-training was to do it before training. And the value function is not updated neither (good point @Miffyli , I did some un-conclusive experiments on that).

Jul 03 '20 13:07 araffin

Just want to leave my final method here in case anyone wants to use the existing weights for pretraining as well:

I simply call the load_parameters() method with the existing model after the global initializer to overwrite the network weights. This way the global initializer takes care of parameter initialization for the Adam optimizer and you also have the existing weights you want.

In case you concerns about the model convergence after performing such pertaining due to incorrect value estimates, you can simply remove the value function parameters when calling load_parameters().

Jul 20 '20 17:07 SolaWeng

Quick question on this topic. Araffin commented on July 3rd that '...the value function is not updated neither.' Just wanted to confirm that this means that pretraining the model does NOT pretrain the value function and ONLY trains the agent to replicate the actions from the expert dataset?

Dec 10 '20 14:12 eflopez1

@eflopez1 That's correct. Only policy/pi/actor part is updated during pre-training, not value/v/critic part.

Dec 10 '20 14:12 Miffyli

stable-baselines stable-baselines copied to clipboard

[question] PPO2 pretrain always resets weights?

stable-baselines
stable-baselines copied to clipboard