imitation HuggingFace models outdated

Bug description

Attempting to load SB3 models from Huggingface in serialize.py often raises a FileExistsError, that tells us "Outdated policy format: we do not support restoring normalization statistics from '{vec_normalize_path}'". This happens with the following environments:

seals/Ant-v0
seals/HalfCheetah-v0
seals/Hopper-v0
seals/Humanoid-v0
seals/MountainCar-v0
seals/Swimmer-v0
seals/Walker2d-v0

Probably the solution is to retrain things for these environments and re-upload.

[EDIT: ran into this problem running experiments/rollouts_from_policies.sh in #572.]

Oct 10 '22 12:10 dfilan

Oh just realized that error should be an f-string, will make a quick PR for that. [EDIT: said PR is #577]

Oct 10 '22 13:10 dfilan

We removed the logic to save normalization statistics long before we added support for Hugging Face, so I'm confused by this. I suspect the error message may be misleading in some way, we should figure out what the root cause is before retraining. What's a minimal command to reproduce this error?

Oct 10 '22 17:10 AdamGleave

I can't replicate this error:

$ python -m imitation.scripts.eval_policy with seals_ant
INFO - eval_policy - Running command 'eval_policy'
INFO - eval_policy - Started run with ID "177"
INFO - imitation.scripts.common.common - Logging to output/eval_policy/seals_Ant-v0/20221010_123805_9cf5ab
Downloading: 100%|██████████| 323k/323k [00:00<00:00, 1.22MB/s]
INFO - root - Loading Stable Baselines policy for '<class 'stable_baselines3.ppo.ppo.PPO'>' from '/home/adam/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/ppo-seals-Ant-v0.zip'
INFO - eval_policy - Result: {'n_traj': 16, 'monitor_return_len': 16, 'return_min': -388.0332974897676, 'return_mean': -305.96807839663273, 'return_std': 125.22417901084856, 'return_max': 140.2589728364431, 'len_min': 1000, 'len_mean': 1000.0, 'len_std': 0.0, 'len_max': 1000, 'monitor_return_min': -388.033297, 'monitor_return_mean': -305.96807843749997, 'monitor_return_std': 125.22417901660532, 'monitor_return_max': 140.258973}
INFO - eval_policy - Completed after 0:00:14
/home/adam/dev/imitation/venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:179: UserWarning: tee_stdout.wait timeout. Forcibly terminating.
  warnings.warn("tee_stdout.wait timeout. Forcibly terminating.")
/home/adam/dev/imitation/venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:185: UserWarning: tee_stderr.wait timeout. Forcibly terminating.
  warnings.warn("tee_stderr.wait timeout. Forcibly terminating.")

completes successfully on my machine. experiments/rollouts_from_policies.sh also seems to work:

(venv) adam@puffin:~/dev/imitation$ CUDA_VISIBLE_DEVICES="" ./experiments/rollouts_from_policies.sh
Writing logs in output/train_experts/2022-10-10T12:46:31-07:00, and saving rollouts in output/train_experts/2022-10-10T12:46:31-07:00/expert_models/*/rollouts/
(venv) adam@puffin:~/dev/imitation$ ls -la output/train_experts/2022-10-10T12:46:31-07:00/expert_models/*/rollouts/
'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_ant_0/rollouts/':
total 16180
drwxrwxr-x 2 adam adam     4096 Oct 10 12:47 .
drwxrwxr-x 3 adam adam     4096 Oct 10 12:47 ..
-rw-rw-r-- 1 adam adam 16556628 Oct 10 12:47 final.pkl

'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_cartpole_0/rollouts/':
total 344
drwxrwxr-x 2 adam adam   4096 Oct 10 12:46 .
drwxrwxr-x 3 adam adam   4096 Oct 10 12:46 ..
-rw-rw-r-- 1 adam adam 341027 Oct 10 12:46 final.pkl

'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_half_cheetah_0/rollouts/':
total 8948
drwxrwxr-x 2 adam adam    4096 Oct 10 12:46 .
drwxrwxr-x 3 adam adam    4096 Oct 10 12:46 ..
-rw-rw-r-- 1 adam adam 9153145 Oct 10 12:46 final.pkl

'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_hopper_0/rollouts/':
total 5748
drwxrwxr-x 2 adam adam    4096 Oct 10 12:46 .
drwxrwxr-x 3 adam adam    4096 Oct 10 12:46 ..
-rw-rw-r-- 1 adam adam 5876381 Oct 10 12:46 final.pkl

'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_humanoid_0/rollouts/':
total 493756
drwxrwxr-x 2 adam adam      4096 Oct 10 12:49 .
drwxrwxr-x 3 adam adam      4096 Oct 10 12:48 ..
-rw-rw-r-- 1 adam adam 505593605 Oct 10 12:49 final.pkl

'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_mountain_car_0/rollouts/':
total 88
drwxrwxr-x 2 adam adam  4096 Oct 10 12:46 .
drwxrwxr-x 3 adam adam  4096 Oct 10 12:46 ..
-rw-rw-r-- 1 adam adam 78978 Oct 10 12:46 final.pkl

'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_swimmer_0/rollouts/':
total 8544
drwxrwxr-x 2 adam adam    4096 Oct 10 12:47 .
drwxrwxr-x 3 adam adam    4096 Oct 10 12:47 ..
-rw-rw-r-- 1 adam adam 8739476 Oct 10 12:47 final.pkl

'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_walker_0/rollouts/':
total 6828
drwxrwxr-x 2 adam adam    4096 Oct 10 12:47 .
drwxrwxr-x 3 adam adam    4096 Oct 10 12:47 ..
-rw-rw-r-- 1 adam adam 6981235 Oct 10 12:47 final.pkl

Oct 10 '22 19:10 AdamGleave

The result of me running it locally on a fresh clone of imitation:

[15:56:30] daniel@sanrensei:~/imitation$ python -m imitation.scripts.eval_policy with seals_ant
INFO - eval_policy - Running command 'eval_policy'
INFO - eval_policy - Started run with ID "2"
INFO - imitation.scripts.common.common - Logging to /home/daniel/imitation/output/eval_policy/seals_Ant-v0/20221011_155637_7b63fc
INFO - root - Loading Stable Baselines policy for '<class 'stable_baselines3.ppo.ppo.PPO'>' from '/home/daniel/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/ppo-seals-Ant-v0.zip'
/home/daniel/.pyenv/versions/imitation_venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:179: UserWarning: tee_stdout.wait timeout. Forcibly terminating.
  warnings.warn("tee_stdout.wait timeout. Forcibly terminating.")
/home/daniel/.pyenv/versions/imitation_venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:185: UserWarning: tee_stderr.wait timeout. Forcibly terminating.
  warnings.warn("tee_stderr.wait timeout. Forcibly terminating.")
ERROR - eval_policy - Failed after 0:00:09!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/daniel/imitation/src/imitation/scripts/eval_policy.py", line 100, in eval_policy
    expert.get_expert_policy(venv),
  File "/home/daniel/imitation/src/imitation/scripts/common/expert.py", line 45, in get_expert_policy
    return serialize.load_policy(policy_type, venv, **loader_kwargs)
  File "/home/daniel/imitation/src/imitation/policies/serialize.py", line 180, in load_policy
    return agent_loader(venv, **kwargs)
  File "/home/daniel/imitation/src/imitation/policies/serialize.py", line 118, in f
    model = load_stable_baselines_model(cls, filename, venv)
  File "/home/daniel/imitation/src/imitation/policies/serialize.py", line 67, in load_stable_baselines_model
    raise FileExistsError(
FileExistsError: Outdated policy format: we do not support restoring normalization statistics from '/home/daniel/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/vec_normalize.pkl'

Oct 11 '22 13:10 dfilan

It does seem to run on a remote server tho so now I'm confused.

(fresh_venv) daniel@rnn:~/imitation$ python -m imitation.scripts.eval_policy with seals_ant
INFO - eval_policy - Running command 'eval_policy'
INFO - eval_policy - Started run with ID "21"
INFO - imitation.scripts.common.common - Logging to /home/daniel/imitation/output/eval_policy/seals_Ant-v0/20221011_071510_1ac016
INFO - root - Loading Stable Baselines policy for '<class 'stable_baselines3.ppo.ppo.PPO'>' from '/home/daniel/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/ppo-seals-Ant-v0.zip'
INFO - eval_policy - Result: {'n_traj': 16, 'monitor_return_len': 16, 'return_min': -487.502474231496, 'return_mean': -374.8772107723473, 'return_std': 109.11836084542307, 'return_max': -96.24885182454736, 'len_min': 1000, 'len_mean': 1000.0, 'len_std': 0.0, 'len_max': 1000, 'monitor_return_min': -487.502474, 'monitor_return_mean': -374.87721075, 'monitor_return_std': 109.11836078623155, 'monitor_return_max': -96.248852}
INFO - eval_policy - Completed after 0:00:35
/home/daniel/imitation/fresh_venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:179: UserWarning: tee_stdout.wait timeout. Forcibly terminating.
  warnings.warn("tee_stdout.wait timeout. Forcibly terminating.")
/home/daniel/imitation/fresh_venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:185: UserWarning: tee_stderr.wait timeout. Forcibly terminating.
  warnings.warn("tee_stderr.wait timeout. Forcibly terminating.")

Oct 11 '22 14:10 dfilan

Adding to the mystery, there is a file named vec_normalize.pkl in the huggingface repo

Oct 11 '22 14:10 dfilan

So, it looks like the core thing is that on my local machine, the directory cited really does have a vec_normalize.pkl file:

[16:46:30] daniel@sanrensei:~/imitation$ ls ~/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/
args.yml    env_kwargs.yml        train_eval_metrics.zip
config.yml  ppo-seals-Ant-v0.zip  vec_normalize.pkl

but not so on RNN:

(fresh_venv) daniel@rnn:~/imitation$ ls ~/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/
ppo-seals-Ant-v0.zip

No idea why this would be different.

Oct 11 '22 14:10 dfilan

Is the huggingface-sb3 package the same version?

Oct 11 '22 16:10 AdamGleave

Yes - pip show huggingface-sb3 says it's version 2.2.3 on both.

Oct 12 '22 09:10 dfilan

Hm, well it does indeed appear that vec_normalize.pkl is present in the HuggingFace repo: https://huggingface.co/HumanCompatibleAI/ppo-seals-Ant-v0/tree/main

I'm frankly a bit confused by this. Searching for vec_normalize.pkl in rl-baselines3-zoo doesn't find anything -- so doesn't seem to be saved in that codebase, which I thought was what was used to generate these experts. It was used in old versions of train_rl, but it's been deprecated for a while now.

@ernestum any idea what's going on here? I'm a bit worried that the experts aren't going to get normalized properly when we restore them given that imitation is handling normalization in a different way to SB3 upstream. We might want to switch to using train_rl for the experts? I know your preference was to unify things, but I think araffin and others have wanted to stick with VecNormalize for SB3 (which makes sense), but it's an utter nightmare to use in our codebase so I really don't want to go back to that.

Oct 12 '22 19:10 AdamGleave

The vec_normalize.pkl file gets downloaded when we load the trained agent using the rl-zoo library. Imitation only downloads the model zip file. So to replicate the error, first load the agent using the rl-zoo library and then run the eval_policy script.

python3 -m utils.load_from_hub --algo ppo --env seals-Ant-v0 -f logs/ -orga HumanCompatibleAI
python3 -m imitation.scripts.eval_policy with seals_ant

Oct 13 '22 01:10 taufeeque9

Thanks @taufeeque9! Unfortunately although just ignoring vec_normalize.pkl avoids a runtime error, there's still the issue that imitation then is just not using the normalization stats at all. I think we need to retrain the experts to use NormalizeFeaturesExtractor not the deprecated VecNormalize approach. Assigning to @ernestum.

Oct 13 '22 01:10 AdamGleave

The rl-zoo uses a VecNormalize environment wrapper to do normalize observations as well as rewards. Whenever a model is loaded using the zoo scripts (such as to continue training or to evaluate the model) the appropriate VecNormalize wrapper is added to the environment.

The imitation library does not do this when loading a model. Therefore, when loading a model that has been pushed to huggingface using rl-zoo in imitation, the model sees unnormalized observations and rewards. The rewards would only be relevant if we were to continue training with our train_rl.py script which is not one of our use-cases I guess. Can you confirm this @AdamGleave ? The missing observation normalization is more severe since the model will obviously perform worse when it sees the unprocessed observations.

We use a imitation.policies.base.NormalizeFeaturesExtractor to do observation normalization while sb3-zoo uses the VecNormalize environment wrapper for that. So in theory changing the hyperparameters in the zoo to something like:

  normalize: dict(norm_obs=False, norm_reward=True)
  policy_kwargs: dict(features_extractor_class=imitation.policies.base.NormalizeFeaturesExtractor)

should solve the issue with observation normalization for us. However, the sb3-zoo won't find our NormalizeFeaturesExtractor because it can not auto-import it. What works however is implementing a custom policy in imitation.policies.base

class MlpPolicyWithNormalizeFeaturesExtractor(policies.ActorCriticPolicy):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs, features_extractor_class=NormalizeFeaturesExtractor)

and then setting the hyperparameters to

  normalize: dict(norm_obs=False, norm_reward=True)
  policy: 'imitation.policies.base.MlpPolicyWithNormalizeFeaturesExtractor'

This will store the running mean and variance inside the model and automatically normalize the observations before feeding them into the model.

Nov 16 '22 11:11 ernestum

The rewards would only be relevant if we were to continue training with our train_rl.py script which is not one of our use-cases I guess. Can you confirm this @AdamGleave ?

Yeah don't think we need to worry about this use case for now, and can always workaround it by just re-learning the reward normalization statistics on the fly in train_rl if we do need it in the future.

The missing observation normalization is more severe since the model will obviously perform worse when it sees the unprocessed observations.

Yep, that'll break policies!

should solve the issue with observation normalization for us. However, the sb3-zoo won't find our NormalizeFeaturesExtractor because it can not auto-import it. What works however is implementing a custom policy in imitation.policies.base
class MlpPolicyWithNormalizeFeaturesExtractor(policies.ActorCriticPolicy):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs, features_extractor_class=NormalizeFeaturesExtractor)
and then setting the hyperparameters to
  normalize: dict(norm_obs=False, norm_reward=True)
  policy: 'imitation.policies.base.MlpPolicyWithNormalizeFeaturesExtractor'

This is hacky but seems like a viable workaround, for sure it's much better than us replicating Zoo functionality ourselves.

Are we always using MlpPolicy? If not perhaps we want to have something more like FactoryPolicyWithNormalizeFeaturesExtractor that also takes the policy_class.

I'd advocate against putting this in imitation.policies.base since it's quite a Zoo-specific hack, but you could make a new module for it like imitation.policies.sb3_zoo or the like.

Thanks for looking into it and figuring out a solution!

Nov 26 '22 05:11 AdamGleave

I found a better solution: add the feature to use python modules as configs instead of yaml files. That lets us solve all of this without any hacks like factory policies and the like. I nearly finished this in https://github.com/DLR-RM/rl-baselines3-zoo/pull/318. I will add an example of how to use that new feature here soon.

Nov 28 '22 10:11 ernestum

I re-trained the experts for all the above mentioned envs (PPO and SAC where applicable).

We can now specify the normalization like this:

    "seals/MountainCar-v0": dict(
        normalize=dict(norm_obs=False, norm_reward=True),
        policy_kwargs=dict(
            activation_fn=torch.nn.modules.activation.Tanh,
            net_arch=[{"pi": [64, 64], "vf": [64, 64]}],
            features_extractor_class=imitation.policies.base.NormalizeFeaturesExtractor,
        ),
...
    ),

You can find the hyperparameters for PPO here and for SAC here.

To verify that the issue is resolved I ran the experiments/rollouts_from_policies.sh again and there were no more such warnings.

Jan 02 '23 10:01 ernestum

I re-trained the experts for all the above mentioned envs (PPO and SAC where applicable).

Thanks for resolving this Max!

Jan 02 '23 21:01 AdamGleave

imitation imitation copied to clipboard

HuggingFace models outdated

Bug description

imitation
imitation copied to clipboard