imitation
imitation copied to clipboard
HuggingFace models outdated
Bug description
Attempting to load SB3 models from Huggingface in serialize.py
often raises a FileExistsError
, that tells us "Outdated policy format: we do not support restoring normalization statistics from '{vec_normalize_path}'". This happens with the following environments:
- seals/Ant-v0
- seals/HalfCheetah-v0
- seals/Hopper-v0
- seals/Humanoid-v0
- seals/MountainCar-v0
- seals/Swimmer-v0
- seals/Walker2d-v0
Probably the solution is to retrain things for these environments and re-upload.
[EDIT: ran into this problem running experiments/rollouts_from_policies.sh
in #572.]
Oh just realized that error should be an f-string, will make a quick PR for that. [EDIT: said PR is #577]
We removed the logic to save normalization statistics long before we added support for Hugging Face, so I'm confused by this. I suspect the error message may be misleading in some way, we should figure out what the root cause is before retraining. What's a minimal command to reproduce this error?
I can't replicate this error:
$ python -m imitation.scripts.eval_policy with seals_ant
INFO - eval_policy - Running command 'eval_policy'
INFO - eval_policy - Started run with ID "177"
INFO - imitation.scripts.common.common - Logging to output/eval_policy/seals_Ant-v0/20221010_123805_9cf5ab
Downloading: 100%|██████████| 323k/323k [00:00<00:00, 1.22MB/s]
INFO - root - Loading Stable Baselines policy for '<class 'stable_baselines3.ppo.ppo.PPO'>' from '/home/adam/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/ppo-seals-Ant-v0.zip'
INFO - eval_policy - Result: {'n_traj': 16, 'monitor_return_len': 16, 'return_min': -388.0332974897676, 'return_mean': -305.96807839663273, 'return_std': 125.22417901084856, 'return_max': 140.2589728364431, 'len_min': 1000, 'len_mean': 1000.0, 'len_std': 0.0, 'len_max': 1000, 'monitor_return_min': -388.033297, 'monitor_return_mean': -305.96807843749997, 'monitor_return_std': 125.22417901660532, 'monitor_return_max': 140.258973}
INFO - eval_policy - Completed after 0:00:14
/home/adam/dev/imitation/venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:179: UserWarning: tee_stdout.wait timeout. Forcibly terminating.
warnings.warn("tee_stdout.wait timeout. Forcibly terminating.")
/home/adam/dev/imitation/venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:185: UserWarning: tee_stderr.wait timeout. Forcibly terminating.
warnings.warn("tee_stderr.wait timeout. Forcibly terminating.")
completes successfully on my machine. experiments/rollouts_from_policies.sh
also seems to work:
(venv) adam@puffin:~/dev/imitation$ CUDA_VISIBLE_DEVICES="" ./experiments/rollouts_from_policies.sh
Writing logs in output/train_experts/2022-10-10T12:46:31-07:00, and saving rollouts in output/train_experts/2022-10-10T12:46:31-07:00/expert_models/*/rollouts/
(venv) adam@puffin:~/dev/imitation$ ls -la output/train_experts/2022-10-10T12:46:31-07:00/expert_models/*/rollouts/
'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_ant_0/rollouts/':
total 16180
drwxrwxr-x 2 adam adam 4096 Oct 10 12:47 .
drwxrwxr-x 3 adam adam 4096 Oct 10 12:47 ..
-rw-rw-r-- 1 adam adam 16556628 Oct 10 12:47 final.pkl
'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_cartpole_0/rollouts/':
total 344
drwxrwxr-x 2 adam adam 4096 Oct 10 12:46 .
drwxrwxr-x 3 adam adam 4096 Oct 10 12:46 ..
-rw-rw-r-- 1 adam adam 341027 Oct 10 12:46 final.pkl
'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_half_cheetah_0/rollouts/':
total 8948
drwxrwxr-x 2 adam adam 4096 Oct 10 12:46 .
drwxrwxr-x 3 adam adam 4096 Oct 10 12:46 ..
-rw-rw-r-- 1 adam adam 9153145 Oct 10 12:46 final.pkl
'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_hopper_0/rollouts/':
total 5748
drwxrwxr-x 2 adam adam 4096 Oct 10 12:46 .
drwxrwxr-x 3 adam adam 4096 Oct 10 12:46 ..
-rw-rw-r-- 1 adam adam 5876381 Oct 10 12:46 final.pkl
'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_humanoid_0/rollouts/':
total 493756
drwxrwxr-x 2 adam adam 4096 Oct 10 12:49 .
drwxrwxr-x 3 adam adam 4096 Oct 10 12:48 ..
-rw-rw-r-- 1 adam adam 505593605 Oct 10 12:49 final.pkl
'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_mountain_car_0/rollouts/':
total 88
drwxrwxr-x 2 adam adam 4096 Oct 10 12:46 .
drwxrwxr-x 3 adam adam 4096 Oct 10 12:46 ..
-rw-rw-r-- 1 adam adam 78978 Oct 10 12:46 final.pkl
'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_swimmer_0/rollouts/':
total 8544
drwxrwxr-x 2 adam adam 4096 Oct 10 12:47 .
drwxrwxr-x 3 adam adam 4096 Oct 10 12:47 ..
-rw-rw-r-- 1 adam adam 8739476 Oct 10 12:47 final.pkl
'output/train_experts/2022-10-10T12:46:31-07:00/expert_models/seals_walker_0/rollouts/':
total 6828
drwxrwxr-x 2 adam adam 4096 Oct 10 12:47 .
drwxrwxr-x 3 adam adam 4096 Oct 10 12:47 ..
-rw-rw-r-- 1 adam adam 6981235 Oct 10 12:47 final.pkl
The result of me running it locally on a fresh clone of imitation:
[15:56:30] daniel@sanrensei:~/imitation$ python -m imitation.scripts.eval_policy with seals_ant
INFO - eval_policy - Running command 'eval_policy'
INFO - eval_policy - Started run with ID "2"
INFO - imitation.scripts.common.common - Logging to /home/daniel/imitation/output/eval_policy/seals_Ant-v0/20221011_155637_7b63fc
INFO - root - Loading Stable Baselines policy for '<class 'stable_baselines3.ppo.ppo.PPO'>' from '/home/daniel/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/ppo-seals-Ant-v0.zip'
/home/daniel/.pyenv/versions/imitation_venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:179: UserWarning: tee_stdout.wait timeout. Forcibly terminating.
warnings.warn("tee_stdout.wait timeout. Forcibly terminating.")
/home/daniel/.pyenv/versions/imitation_venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:185: UserWarning: tee_stderr.wait timeout. Forcibly terminating.
warnings.warn("tee_stderr.wait timeout. Forcibly terminating.")
ERROR - eval_policy - Failed after 0:00:09!
Traceback (most recent calls WITHOUT Sacred internals):
File "/home/daniel/imitation/src/imitation/scripts/eval_policy.py", line 100, in eval_policy
expert.get_expert_policy(venv),
File "/home/daniel/imitation/src/imitation/scripts/common/expert.py", line 45, in get_expert_policy
return serialize.load_policy(policy_type, venv, **loader_kwargs)
File "/home/daniel/imitation/src/imitation/policies/serialize.py", line 180, in load_policy
return agent_loader(venv, **kwargs)
File "/home/daniel/imitation/src/imitation/policies/serialize.py", line 118, in f
model = load_stable_baselines_model(cls, filename, venv)
File "/home/daniel/imitation/src/imitation/policies/serialize.py", line 67, in load_stable_baselines_model
raise FileExistsError(
FileExistsError: Outdated policy format: we do not support restoring normalization statistics from '/home/daniel/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/vec_normalize.pkl'
It does seem to run on a remote server tho so now I'm confused.
(fresh_venv) daniel@rnn:~/imitation$ python -m imitation.scripts.eval_policy with seals_ant
INFO - eval_policy - Running command 'eval_policy'
INFO - eval_policy - Started run with ID "21"
INFO - imitation.scripts.common.common - Logging to /home/daniel/imitation/output/eval_policy/seals_Ant-v0/20221011_071510_1ac016
INFO - root - Loading Stable Baselines policy for '<class 'stable_baselines3.ppo.ppo.PPO'>' from '/home/daniel/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/ppo-seals-Ant-v0.zip'
INFO - eval_policy - Result: {'n_traj': 16, 'monitor_return_len': 16, 'return_min': -487.502474231496, 'return_mean': -374.8772107723473, 'return_std': 109.11836084542307, 'return_max': -96.24885182454736, 'len_min': 1000, 'len_mean': 1000.0, 'len_std': 0.0, 'len_max': 1000, 'monitor_return_min': -487.502474, 'monitor_return_mean': -374.87721075, 'monitor_return_std': 109.11836078623155, 'monitor_return_max': -96.248852}
INFO - eval_policy - Completed after 0:00:35
/home/daniel/imitation/fresh_venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:179: UserWarning: tee_stdout.wait timeout. Forcibly terminating.
warnings.warn("tee_stdout.wait timeout. Forcibly terminating.")
/home/daniel/imitation/fresh_venv/lib/python3.8/site-packages/sacred/stdout_capturing.py:185: UserWarning: tee_stderr.wait timeout. Forcibly terminating.
warnings.warn("tee_stderr.wait timeout. Forcibly terminating.")
Adding to the mystery, there is a file named vec_normalize.pkl
in the huggingface repo
So, it looks like the core thing is that on my local machine, the directory cited really does have a vec_normalize.pkl
file:
[16:46:30] daniel@sanrensei:~/imitation$ ls ~/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/
args.yml env_kwargs.yml train_eval_metrics.zip
config.yml ppo-seals-Ant-v0.zip vec_normalize.pkl
but not so on RNN:
(fresh_venv) daniel@rnn:~/imitation$ ls ~/.cache/huggingface/hub/models--HumanCompatibleAI--ppo-seals-Ant-v0/snapshots/43aec0ff92a5a0a0b415c7d7d061abd746e67af1/
ppo-seals-Ant-v0.zip
No idea why this would be different.
Is the huggingface-sb3
package the same version?
Yes - pip show huggingface-sb3
says it's version 2.2.3 on both.
Hm, well it does indeed appear that vec_normalize.pkl
is present in the HuggingFace repo: https://huggingface.co/HumanCompatibleAI/ppo-seals-Ant-v0/tree/main
I'm frankly a bit confused by this. Searching for vec_normalize.pkl
in rl-baselines3-zoo
doesn't find anything -- so doesn't seem to be saved in that codebase, which I thought was what was used to generate these experts. It was used in old versions of train_rl
, but it's been deprecated for a while now.
@ernestum any idea what's going on here? I'm a bit worried that the experts aren't going to get normalized properly when we restore them given that imitation
is handling normalization in a different way to SB3 upstream. We might want to switch to using train_rl
for the experts? I know your preference was to unify things, but I think araffin and others have wanted to stick with VecNormalize for SB3 (which makes sense), but it's an utter nightmare to use in our codebase so I really don't want to go back to that.
The vec_normalize.pkl
file gets downloaded when we load the trained agent using the rl-zoo library. Imitation only downloads the model zip file. So to replicate the error, first load the agent using the rl-zoo library and then run the eval_policy script.
python3 -m utils.load_from_hub --algo ppo --env seals-Ant-v0 -f logs/ -orga HumanCompatibleAI
python3 -m imitation.scripts.eval_policy with seals_ant
Thanks @taufeeque9! Unfortunately although just ignoring vec_normalize.pkl
avoids a runtime error, there's still the issue that imitation
then is just not using the normalization stats at all. I think we need to retrain the experts to use NormalizeFeaturesExtractor
not the deprecated VecNormalize
approach. Assigning to @ernestum.
The rl-zoo uses a VecNormalize
environment wrapper to do normalize observations as well as rewards. Whenever a model is loaded using the zoo scripts (such as to continue training or to evaluate the model) the appropriate VecNormalize
wrapper is added to the environment.
The imitation library does not do this when loading a model.
Therefore, when loading a model that has been pushed to huggingface using rl-zoo in imitation, the model sees unnormalized observations and rewards.
The rewards would only be relevant if we were to continue training with our train_rl.py
script which is not one of our use-cases I guess. Can you confirm this @AdamGleave ?
The missing observation normalization is more severe since the model will obviously perform worse when it sees the unprocessed observations.
We use a imitation.policies.base.NormalizeFeaturesExtractor
to do observation normalization while sb3-zoo uses the VecNormalize
environment wrapper for that. So in theory changing the hyperparameters in the zoo to something like:
normalize: dict(norm_obs=False, norm_reward=True)
policy_kwargs: dict(features_extractor_class=imitation.policies.base.NormalizeFeaturesExtractor)
should solve the issue with observation normalization for us.
However, the sb3-zoo won't find our NormalizeFeaturesExtractor
because it can not auto-import it.
What works however is implementing a custom policy in imitation.policies.base
class MlpPolicyWithNormalizeFeaturesExtractor(policies.ActorCriticPolicy):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs, features_extractor_class=NormalizeFeaturesExtractor)
and then setting the hyperparameters to
normalize: dict(norm_obs=False, norm_reward=True)
policy: 'imitation.policies.base.MlpPolicyWithNormalizeFeaturesExtractor'
This will store the running mean and variance inside the model and automatically normalize the observations before feeding them into the model.
The rewards would only be relevant if we were to continue training with our
train_rl.py
script which is not one of our use-cases I guess. Can you confirm this @AdamGleave ?
Yeah don't think we need to worry about this use case for now, and can always workaround it by just re-learning the reward normalization statistics on the fly in train_rl
if we do need it in the future.
The missing observation normalization is more severe since the model will obviously perform worse when it sees the unprocessed observations.
Yep, that'll break policies!
should solve the issue with observation normalization for us. However, the sb3-zoo won't find our
NormalizeFeaturesExtractor
because it can not auto-import it. What works however is implementing a custom policy inimitation.policies.base
class MlpPolicyWithNormalizeFeaturesExtractor(policies.ActorCriticPolicy): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs, features_extractor_class=NormalizeFeaturesExtractor)
and then setting the hyperparameters to
normalize: dict(norm_obs=False, norm_reward=True) policy: 'imitation.policies.base.MlpPolicyWithNormalizeFeaturesExtractor'
This is hacky but seems like a viable workaround, for sure it's much better than us replicating Zoo functionality ourselves.
Are we always using MlpPolicy
? If not perhaps we want to have something more like FactoryPolicyWithNormalizeFeaturesExtractor
that also takes the policy_class
.
I'd advocate against putting this in imitation.policies.base
since it's quite a Zoo-specific hack, but you could make a new module for it like imitation.policies.sb3_zoo
or the like.
Thanks for looking into it and figuring out a solution!
I found a better solution: add the feature to use python modules as configs instead of yaml files. That lets us solve all of this without any hacks like factory policies and the like. I nearly finished this in https://github.com/DLR-RM/rl-baselines3-zoo/pull/318. I will add an example of how to use that new feature here soon.
I re-trained the experts for all the above mentioned envs (PPO and SAC where applicable).
We can now specify the normalization like this:
"seals/MountainCar-v0": dict(
normalize=dict(norm_obs=False, norm_reward=True),
policy_kwargs=dict(
activation_fn=torch.nn.modules.activation.Tanh,
net_arch=[{"pi": [64, 64], "vf": [64, 64]}],
features_extractor_class=imitation.policies.base.NormalizeFeaturesExtractor,
),
...
),
You can find the hyperparameters for PPO here and for SAC here.
To verify that the issue is resolved I ran the experiments/rollouts_from_policies.sh
again and there were no more such warnings.
I re-trained the experts for all the above mentioned envs (PPO and SAC where applicable).
Thanks for resolving this Max!