imitation icon indicating copy to clipboard operation
imitation copied to clipboard

Load expert models for testing from huggingface hub

Open ernestum opened this issue 2 years ago • 13 comments

This is a draft until https://github.com/DLR-RM/rl-baselines3-zoo/pull/257 is resolved.

ernestum avatar Jun 07 '22 14:06 ernestum

I am a bit stuck right now and hope to get some input. Maybe @AdamGleave or @araffin can help?

The first issue is with the helpers in imitation.policies.serialize. They impose a format in which the model is always stored in a file called model.zip which is not compatible with the format, that is imposed by huggingface. I can either work around this in tests by moving files around and renaming them (ugly) or try to modify/remove the module imitation.policies.serialize. The thing is, that I don't know who is using imitation.policies.serialize outside of the imitation library. Any preferences here @AdamGleave ?

Another problem I have in the preferences comparison implementation. Its test fail because even though I provide a venv with 2 envs upon construction/loading, for some reason it expects an 8 env venv during execution. I could not figure out why yet. It is really confusing and frustrating.

ernestum avatar Jun 21 '22 15:06 ernestum

The first issue is with the helpers in imitation.policies.serialize. They impose a format in which the model is always stored in a file called model.zip which is not compatible with the format, that is imposed by huggingface. I can either work around this in tests by moving files around and renaming them (ugly) or try to modify/remove the module imitation.policies.serialize. The thing is, that I don't know who is using imitation.policies.serialize outside of the imitation library. Any preferences here @AdamGleave ?

In general I'm not too concerned about making breaking changes as imitation is not yet at a v1 release, more important that we move towards a good API than keeping consistency.

How are you intending for the HuggingFace loading to work? serialize.py maintains a registry of loaders serialize.policy_registry, so one way would just be to register a new loader hugging_face (or perhaps hugging_face_ppo, hugging_face_sac, etc) and have the path be the ID of the HF model. In that case there's no need to even use load_stable_baselines_model I don't think -- it's not doing much, anyway.

To clarify we do want to at least maintain the option of loading things from local disk, so I wouldn't get rid of imitation/.policies.serialize in favor of hugging face, although if there's a way to simplify the serialize code that'd be great.

Another problem I have in the preferences comparison implementation. Its test fail because even though I provide a venv with 2 envs upon construction/loading, for some reason it expects an 8 env venv during execution. I could not figure out why yet. It is really confusing and frustrating.

What command can I run to replicate this failure? In CircleCI logs I see some scripts failing, but nothing that seemed preference comparison specific.

I've seen policies and reward networks sometimes have the number of envs get baked into their expected observation/action shape in the past. Although I thought that was no longer an issue with SB3 & PyTorch imitation.

AdamGleave avatar Jun 21 '22 22:06 AdamGleave

What command can I run to replicate this failure? In CircleCI logs I see some scripts failing, but nothing that seemed preference comparison specific.

Seems like the CI tests timed out. I can reproduce it locally using:

pytest tests/test_scripts.py::test_train_preference_comparisons_main 

ernestum avatar Jun 22 '22 17:06 ernestum

I see it makes sense to integrate the huggingface loading logic into serialize.py, especially if it is ok to make breaking changes. When loading from huggingface we pull the model zip file to disk and then use the normal .load() method of the algorithm. This is quite similar to how we load in serialize.py so with some refactoring I should be able to find a satisfying solution that has no duplicated code.

ernestum avatar Jun 22 '22 17:06 ernestum

Concerning demonstrations: I agree we still want to load demonstrations from disk. Right now there is no canonical way to upload RL-demonstrations to huggingface hub but there is the notion of datasets on the hub so I hope to integrate that feature into huggingface_sb3 soon. In the meantime I will keep the option to load demonstrations from disk, or have them created ad-hoc from a model that was pulled from the hub.

ernestum avatar Jun 22 '22 17:06 ernestum

Concerning demonstrations: I agree we still want to load demonstrations from disk. Right now there is no canonical way to upload RL-demonstrations to huggingface hub but there is the notion of datasets on the hub so I hope to integrate that feature into huggingface_sb3 soon. In the meantime I will keep the option to load demonstrations from disk, or have them created ad-hoc from a model that was pulled from the hub.

Being able to download demos from HuggingFace would be cool but even so I think good to keep ability to load from disk. I don't want to force users to use a third-party vendor, and there are some scenarios where data might be sensitive and people might not want to store it in the cloud.

AdamGleave avatar Jun 22 '22 18:06 AdamGleave

I've seen policies and reward networks sometimes have the number of envs get baked into their expected observation/action shape in the past. Although I thought that was no longer an issue with SB3 & PyTorch imitation.

I encountered that issue with SB3+pytorch as well a while ago but never got around to debugging it. I just used 8 envs for the test as a workaround, see https://github.com/HumanCompatibleAI/imitation/blob/301e7a5ca27ce069242fe7e4426feffa0089fa12/tests/test_scripts.py#L89-L92

@ernestum Looks like this PR removes that part of the test config, I'd guess that's what causes the error? Though of course it would be nice to fix the underlying bug instead

ejnnr avatar Jun 22 '22 18:06 ejnnr

What command can I run to replicate this failure? In CircleCI logs I see some scripts failing, but nothing that seemed preference comparison specific.

Seems like the CI tests timed out.

Worth looking into what's causing the timeouts as well. Hopefully not an actual slow down from downloading stuff from HF? (If so we could always cache it.)

I can reproduce it locally using:

pytest tests/test_scripts.py::test_train_preference_comparisons_main 

OK, I can replicate this, and also with python -m imitation.scripts.train_preference_comparisons with load_expert=True common.num_vec=2.

The issue appears to be that self.algorithm.rollout_buffer.n_envs is 8 even though self.algorithm.n_envs and the actual number of environments is 2. I assume PPO.load is restoring the rollout buffer. This seems like an issue that needs to be fixed upstream.

Separately, I will note that using a pre-trained agent with preference comparisons is a bit of an edge case. Seems nice to have support for it for debugging, or for building things like learning from demonstrations and human preferences (IRL followed by preference comparison), but it's not the vanilla algorithm.

AdamGleave avatar Jun 22 '22 19:06 AdamGleave

Right now the remaining failing tests are test_scripts.py::test_analyze_imitation and test_scripts.py::test_convert_trajs_in_place. Before I put time into understanding and debugging those I wanted to verify, that they still used?

test_convert_trajs_in_place seems to fail since the analyze script returns no results, which is again due to the _gather_sacred_dicts helper function not returning anything. @AdamGleave do you have a hint for me where to look for the issue before I dive into this on my own?

ernestum avatar Jul 06 '22 14:07 ernestum

What was it initially supposed to do? It seems to abort the script when some log files don't include the term "Result". But what is the consequence? What does it mean if the check fails and what are the required actions? Is it safe to remove this section?

When our training scripts complete successfully, they output a line in the logs like (from running python -m imitation.scripts.train_adversarial gail with fast rl.fast train.fast common.fast):

INFO - train_adversarial - Result: {'imit_stats': {'n_traj': 2, 'monitor_return_len': 2, 'return_min': 3.2304581999778748, 'return_mean': 3.4182530641555786, 'return_std': 0.18779486417770386, 'return_max': 3.6060479283332825, 'len_min': 5, 'len_mean': 5.0, 'len_std': 0.0, 'len_max': 5, 'monitor_return_min': 5.0, 'monitor_return_mean': 5.0, 'monitor_return_std': 0.0, 'monitor_return_max': 5.0}, 'expert_stats': {'n_traj': 40, 'return_min': 500.0, 'return_mean': 500.0, 'return_std': 0.0, 'return_max': 500.0, 'len_min': 500, 'len_mean': 500.0, 'len_std': 0.0, 'len_max': 500}}

This is output by Sacred automatically, from the return value of the command executed (in this case, the function train_adversarial).

So, the lines you removed in https://github.com/HumanCompatibleAI/imitation/commit/bfa8831aa56c85185bb8e5999b48f1d9906c23ed were going through all the log directories and searching for the results. It's not a check so much as about printing out relevant results -- although if it fails, that does indeed suggest something's off.

Right now the remaining failing tests are test_scripts.py::test_analyze_imitation and test_scripts.py::test_convert_trajs_in_place. Before I put time into understanding and debugging those I wanted to verify, that they still used?

convert_trajs_in_place seems useful still, although we should update the documentation. We switched to an npz style format for trajectories in https://github.com/HumanCompatibleAI/imitation/pull/448 so having a script to convert from the old-style pickle format is handy.

analyze produces CSV/TeX tables of results from benchmarking. I've never been that happy with how it works, so I'm happy to see it replaced, but it's better than nothing so I wouldn't want to delete it without a plan to improve on it.

test_convert_trajs_in_place seems to fail since the analyze script returns no results, which is again due to the _gather_sacred_dicts helper function not returning anything. @AdamGleave do you have a hint for me where to look for the issue before I dive into this on my own?

I tried to run experiments/imit_benchmark.sh --fast on your branch to see what happened. As far as I can tell the problem is just the scripts it's running fail? For example, output/imit_benchmark/2022-07-07T16\:39\:16-07\:00/cartpole_0/n_expert_demos_1/sacred/run.json (timestamp will be different on your machine but otherwise the logs should still exist) shows this traceback:

  "fail_trace": [
    "Traceback (most recent call last):\n",
    "  File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/sacred/config/captured_function.py\", line 42, in captured_function\n    result = wrapped(*args, **kwargs)\n",
    "  File \"/home/adam/dev/imitation/src/imitation/scripts/train_adversarial.py\", line 155, in gail\n    return train_adversarial(algo_cls=gail_algo.GAIL)\n",
    "  File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/sacred/config/captured_function.py\", line 42, in captured_function\n    result = wrapped(*args, **kwargs)\n",
    "  File \"/home/adam/dev/imitation/src/imitation/scripts/train_adversarial.py\", line 112, in train_adversarial\n    expert_policy = load_expert_policy(env_name)\n",
    "  File \"/home/adam/dev/imitation/src/imitation/scripts/common/experts.py\", line 15, in load_expert_policy\n    load_from_hub(\n",
    "  File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/huggingface_sb3/load_from_hub.py\", line 16, in load_from_hub\n    downloaded_model_file = hf_hub_download(\n",
    "  File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/huggingface_hub/file_download.py\", line 1020, in hf_hub_download\n    _raise_for_status(r)\n",
    "  File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py\", line 78, in _raise_for_status\n    raise RepositoryNotFoundError(\n",
    "huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error: Repository Not Found for url: https://huggingface.co/ernestumorga/ppo-CartPole-v1/resolve/main/ppo-CartPole-v1.zip. If the repo is private, make sure you are authenticated. (Request ID: N4BPRkJkdBtp8hwuZOKtG)\n"
  ],

But perhaps this would work if I was logged in as you...? Still, that log directory seems like a good place to start.

AdamGleave avatar Jul 07 '22 23:07 AdamGleave

I can easily make the test_convert_trajs_in_place case pass by adding back the demonstrations (final.pkl) in the testdata folder. However, I figured out that the final.pkl in the current master branch already seems to be in the new format. At least the diff between that file and the one created in the temp directory during testing indicates so. Therefore the test is pointless unless we get hold of such a trajectory in the old format. Do you know where to get one @AdamGleave ?

ernestum avatar Jul 08 '22 12:07 ernestum

I can easily make the test_convert_trajs_in_place case pass by adding back the demonstrations (final.pkl) in the testdata folder. However, I figured out that the final.pkl in the current master branch already seems to be in the new format. At least the diff between that file and the one created in the temp directory during testing indicates so. Therefore the test is pointless unless we get hold of such a trajectory in the old format. Do you know where to get one @AdamGleave ?

Thanks for flagging this. I think I removed the backwards-compatibility logic in load in https://github.com/HumanCompatibleAI/imitation/pull/337 but forgot to remove the script. However, this script convert_trajs_in_place.py has a new purpose now that https://github.com/HumanCompatibleAI/imitation/pull/448 is merged. On master it does produce different output:

$ python -m imitation.scripts.convert_trajs_in_place tests/testdata/expert_models/cartpole_0/rollouts/final.pkl 
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   tests/testdata/expert_models/cartpole_0/rollouts/final.pkl
$ file tests/testdata/expert_models/cartpole_0/rollouts/final.pkl  # note it was pkl before
tests/testdata/expert_models/cartpole_0/rollouts/final.pkl: Zip archive data, at least v2.0 to extract, compression method=deflate

That said, the documentation in the script is out of date, and it should probably rename pkl->zip to avoid confusion. I'll make a PR addressing that.

Note this branch is quite a bit behind master at this point, you may want to merge it in so you can see these changes...

AdamGleave avatar Jul 08 '22 23:07 AdamGleave

Codecov Report

Merging #445 (c9dcd73) into master (dad9a30) will increase coverage by 0.07%. The diff coverage is 94.16%.

@@            Coverage Diff             @@
##           master     #445      +/-   ##
==========================================
+ Coverage   97.08%   97.16%   +0.07%     
==========================================
  Files          84       85       +1     
  Lines        7658     7646      -12     
==========================================
- Hits         7435     7429       -6     
+ Misses        223      217       -6     
Impacted Files Coverage Δ
src/imitation/data/rollout.py 100.00% <ø> (ø)
src/imitation/scripts/parallel.py 59.64% <ø> (-3.85%) :arrow_down:
src/imitation/scripts/config/train_imitation.py 67.79% <50.00%> (-5.09%) :arrow_down:
src/imitation/scripts/config/train_adversarial.py 70.88% <60.00%> (+6.78%) :arrow_up:
src/imitation/util/registry.py 90.69% <66.66%> (+1.80%) :arrow_up:
src/imitation/algorithms/bc.py 100.00% <100.00%> (ø)
src/imitation/policies/serialize.py 100.00% <100.00%> (ø)
src/imitation/scripts/common/common.py 97.22% <100.00%> (ø)
src/imitation/scripts/common/demonstrations.py 100.00% <100.00%> (ø)
src/imitation/scripts/common/expert.py 100.00% <100.00%> (ø)
... and 15 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Jul 30 '22 00:07 codecov[bot]

The remaining failing test is for benchmark_and_table.sh. It fails since it calls bc_benchmark.sh with the --paper option. This will try to train a seals_half_cheetah policy which fails since mujoco has been removed in #515 . No I am quite confused: Why did this not make the pipeline fail earlier? Answer: After #515, the Docker image has not been updated even though the Dockerfile has been changed. Therefore mujoco was/is still present in the (now stale) Dockerfile. But then: why is mujoco gone now? Since the test fails with the usual

Exception: 
You appear to be missing MuJoCo.  We expected to find the file here: /root/.mujoco/mujoco210

This package only provides python bindings, the library must be installed separately.

Please follow the instructions on the README to install MuJoCo

    https://github.com/openai/mujoco-py#install-mujoco

Which can be downloaded from the website

    https://www.roboti.us/index.html

message.

ernestum avatar Sep 02 '22 12:09 ernestum

The remaining failing test is for benchmark_and_table.sh. It fails since it calls bc_benchmark.sh with the --paper option.

It looks like it was skipped before, because it was in the USES_FULL_ROLLOUTS tuple in test_experiments.py. You removed that check as part of the port, so it's no longer skipped.

I've pushed a commit that resolves this by having benchmark_and_table.sh no longer pass the --paper flag to bc_benchmark.sh and dagger_benchmark.sh when --fast is specified. I've also taken the liberty of removing mujoco-py from setup.py test requirements as I think it's no longer needed given this change (and will break on CI anyway as you observed).

I've also updated the Docker image on DockerHub now -- thanks for catching that.

AdamGleave avatar Sep 06 '22 01:09 AdamGleave

Not sure why the coverage suddenly dropped in a totally unrelated file. The naive solution which would be just running a config of preference comparisons with the active_selection flag set fails for some reason. Before I dig deeper into this I am waiting for your thoughs @AdamGleave .

ernestum avatar Sep 12 '22 11:09 ernestum

The coverage seems fine to me -- the only check that's failing is an optional coverage check on the patch. There seems to be a bunch of indirect coverage changes due to you changing the tests -- so some environment named configs that were previously covered no longer are. This seems harmless.

AdamGleave avatar Sep 12 '22 21:09 AdamGleave