imitation
imitation copied to clipboard
Load expert models for testing from huggingface hub
This is a draft until https://github.com/DLR-RM/rl-baselines3-zoo/pull/257 is resolved.
I am a bit stuck right now and hope to get some input. Maybe @AdamGleave or @araffin can help?
The first issue is with the helpers in imitation.policies.serialize
.
They impose a format in which the model is always stored in a file called model.zip
which is not compatible with the format, that is imposed by huggingface.
I can either work around this in tests by moving files around and renaming them (ugly) or try to modify/remove the module imitation.policies.serialize
.
The thing is, that I don't know who is using imitation.policies.serialize
outside of the imitation
library.
Any preferences here @AdamGleave ?
Another problem I have in the preferences comparison implementation. Its test fail because even though I provide a venv with 2 envs upon construction/loading, for some reason it expects an 8 env venv during execution. I could not figure out why yet. It is really confusing and frustrating.
The first issue is with the helpers in
imitation.policies.serialize
. They impose a format in which the model is always stored in a file calledmodel.zip
which is not compatible with the format, that is imposed by huggingface. I can either work around this in tests by moving files around and renaming them (ugly) or try to modify/remove the moduleimitation.policies.serialize
. The thing is, that I don't know who is usingimitation.policies.serialize
outside of theimitation
library. Any preferences here @AdamGleave ?
In general I'm not too concerned about making breaking changes as imitation is not yet at a v1 release, more important that we move towards a good API than keeping consistency.
How are you intending for the HuggingFace loading to work? serialize.py
maintains a registry of loaders serialize.policy_registry
, so one way would just be to register a new loader hugging_face
(or perhaps hugging_face_ppo
, hugging_face_sac
, etc) and have the path be the ID of the HF model. In that case there's no need to even use load_stable_baselines_model
I don't think -- it's not doing much, anyway.
To clarify we do want to at least maintain the option of loading things from local disk, so I wouldn't get rid of imitation/.policies.serialize
in favor of hugging face, although if there's a way to simplify the serialize code that'd be great.
Another problem I have in the preferences comparison implementation. Its test fail because even though I provide a venv with 2 envs upon construction/loading, for some reason it expects an 8 env venv during execution. I could not figure out why yet. It is really confusing and frustrating.
What command can I run to replicate this failure? In CircleCI logs I see some scripts failing, but nothing that seemed preference comparison specific.
I've seen policies and reward networks sometimes have the number of envs get baked into their expected observation/action shape in the past. Although I thought that was no longer an issue with SB3 & PyTorch imitation.
What command can I run to replicate this failure? In CircleCI logs I see some scripts failing, but nothing that seemed preference comparison specific.
Seems like the CI tests timed out. I can reproduce it locally using:
pytest tests/test_scripts.py::test_train_preference_comparisons_main
I see it makes sense to integrate the huggingface loading logic into serialize.py
, especially if it is ok to make breaking changes.
When loading from huggingface we pull the model zip file to disk and then use the normal .load()
method of the algorithm. This is quite similar to how we load in serialize.py
so with some refactoring I should be able to find a satisfying solution that has no duplicated code.
Concerning demonstrations: I agree we still want to load demonstrations from disk.
Right now there is no canonical way to upload RL-demonstrations to huggingface hub but there is the notion of datasets on the hub so I hope to integrate that feature into huggingface_sb3
soon.
In the meantime I will keep the option to load demonstrations from disk, or have them created ad-hoc from a model that was pulled from the hub.
Concerning demonstrations: I agree we still want to load demonstrations from disk. Right now there is no canonical way to upload RL-demonstrations to huggingface hub but there is the notion of datasets on the hub so I hope to integrate that feature into
huggingface_sb3
soon. In the meantime I will keep the option to load demonstrations from disk, or have them created ad-hoc from a model that was pulled from the hub.
Being able to download demos from HuggingFace would be cool but even so I think good to keep ability to load from disk. I don't want to force users to use a third-party vendor, and there are some scenarios where data might be sensitive and people might not want to store it in the cloud.
I've seen policies and reward networks sometimes have the number of envs get baked into their expected observation/action shape in the past. Although I thought that was no longer an issue with SB3 & PyTorch imitation.
I encountered that issue with SB3+pytorch as well a while ago but never got around to debugging it. I just used 8 envs for the test as a workaround, see https://github.com/HumanCompatibleAI/imitation/blob/301e7a5ca27ce069242fe7e4426feffa0089fa12/tests/test_scripts.py#L89-L92
@ernestum Looks like this PR removes that part of the test config, I'd guess that's what causes the error? Though of course it would be nice to fix the underlying bug instead
What command can I run to replicate this failure? In CircleCI logs I see some scripts failing, but nothing that seemed preference comparison specific.
Seems like the CI tests timed out.
Worth looking into what's causing the timeouts as well. Hopefully not an actual slow down from downloading stuff from HF? (If so we could always cache it.)
I can reproduce it locally using:
pytest tests/test_scripts.py::test_train_preference_comparisons_main
OK, I can replicate this, and also with python -m imitation.scripts.train_preference_comparisons with load_expert=True common.num_vec=2
.
The issue appears to be that self.algorithm.rollout_buffer.n_envs
is 8 even though self.algorithm.n_envs
and the actual number of environments is 2. I assume PPO.load
is restoring the rollout buffer. This seems like an issue that needs to be fixed upstream.
Separately, I will note that using a pre-trained agent with preference comparisons is a bit of an edge case. Seems nice to have support for it for debugging, or for building things like learning from demonstrations and human preferences (IRL followed by preference comparison), but it's not the vanilla algorithm.
Right now the remaining failing tests are test_scripts.py::test_analyze_imitation
and test_scripts.py::test_convert_trajs_in_place
.
Before I put time into understanding and debugging those I wanted to verify, that they still used?
test_convert_trajs_in_place
seems to fail since the analyze script returns no results, which is again due to the _gather_sacred_dicts
helper function not returning anything. @AdamGleave do you have a hint for me where to look for the issue before I dive into this on my own?
What was it initially supposed to do? It seems to abort the script when some log files don't include the term "Result". But what is the consequence? What does it mean if the check fails and what are the required actions? Is it safe to remove this section?
When our training scripts complete successfully, they output a line in the logs like (from running python -m imitation.scripts.train_adversarial gail with fast rl.fast train.fast common.fast
):
INFO - train_adversarial - Result: {'imit_stats': {'n_traj': 2, 'monitor_return_len': 2, 'return_min': 3.2304581999778748, 'return_mean': 3.4182530641555786, 'return_std': 0.18779486417770386, 'return_max': 3.6060479283332825, 'len_min': 5, 'len_mean': 5.0, 'len_std': 0.0, 'len_max': 5, 'monitor_return_min': 5.0, 'monitor_return_mean': 5.0, 'monitor_return_std': 0.0, 'monitor_return_max': 5.0}, 'expert_stats': {'n_traj': 40, 'return_min': 500.0, 'return_mean': 500.0, 'return_std': 0.0, 'return_max': 500.0, 'len_min': 500, 'len_mean': 500.0, 'len_std': 0.0, 'len_max': 500}}
This is output by Sacred automatically, from the return value of the command executed (in this case, the function train_adversarial
).
So, the lines you removed in https://github.com/HumanCompatibleAI/imitation/commit/bfa8831aa56c85185bb8e5999b48f1d9906c23ed were going through all the log directories and searching for the results. It's not a check so much as about printing out relevant results -- although if it fails, that does indeed suggest something's off.
Right now the remaining failing tests are
test_scripts.py::test_analyze_imitation
andtest_scripts.py::test_convert_trajs_in_place
. Before I put time into understanding and debugging those I wanted to verify, that they still used?
convert_trajs_in_place
seems useful still, although we should update the documentation. We switched to an npz style format for trajectories in https://github.com/HumanCompatibleAI/imitation/pull/448 so having a script to convert from the old-style pickle format is handy.
analyze
produces CSV/TeX tables of results from benchmarking. I've never been that happy with how it works, so I'm happy to see it replaced, but it's better than nothing so I wouldn't want to delete it without a plan to improve on it.
test_convert_trajs_in_place
seems to fail since the analyze script returns no results, which is again due to the_gather_sacred_dicts
helper function not returning anything. @AdamGleave do you have a hint for me where to look for the issue before I dive into this on my own?
I tried to run experiments/imit_benchmark.sh --fast
on your branch to see what happened. As far as I can tell the problem is just the scripts it's running fail? For example, output/imit_benchmark/2022-07-07T16\:39\:16-07\:00/cartpole_0/n_expert_demos_1/sacred/run.json
(timestamp will be different on your machine but otherwise the logs should still exist) shows this traceback:
"fail_trace": [
"Traceback (most recent call last):\n",
" File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/sacred/config/captured_function.py\", line 42, in captured_function\n result = wrapped(*args, **kwargs)\n",
" File \"/home/adam/dev/imitation/src/imitation/scripts/train_adversarial.py\", line 155, in gail\n return train_adversarial(algo_cls=gail_algo.GAIL)\n",
" File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/sacred/config/captured_function.py\", line 42, in captured_function\n result = wrapped(*args, **kwargs)\n",
" File \"/home/adam/dev/imitation/src/imitation/scripts/train_adversarial.py\", line 112, in train_adversarial\n expert_policy = load_expert_policy(env_name)\n",
" File \"/home/adam/dev/imitation/src/imitation/scripts/common/experts.py\", line 15, in load_expert_policy\n load_from_hub(\n",
" File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/huggingface_sb3/load_from_hub.py\", line 16, in load_from_hub\n downloaded_model_file = hf_hub_download(\n",
" File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/huggingface_hub/file_download.py\", line 1020, in hf_hub_download\n _raise_for_status(r)\n",
" File \"/home/adam/dev/imitation/venv/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py\", line 78, in _raise_for_status\n raise RepositoryNotFoundError(\n",
"huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error: Repository Not Found for url: https://huggingface.co/ernestumorga/ppo-CartPole-v1/resolve/main/ppo-CartPole-v1.zip. If the repo is private, make sure you are authenticated. (Request ID: N4BPRkJkdBtp8hwuZOKtG)\n"
],
But perhaps this would work if I was logged in as you...? Still, that log directory seems like a good place to start.
I can easily make the test_convert_trajs_in_place
case pass by adding back the demonstrations (final.pkl
) in the testdata folder.
However, I figured out that the final.pkl
in the current master branch already seems to be in the new format.
At least the diff between that file and the one created in the temp directory during testing indicates so.
Therefore the test is pointless unless we get hold of such a trajectory in the old format.
Do you know where to get one @AdamGleave ?
I can easily make the
test_convert_trajs_in_place
case pass by adding back the demonstrations (final.pkl
) in the testdata folder. However, I figured out that thefinal.pkl
in the current master branch already seems to be in the new format. At least the diff between that file and the one created in the temp directory during testing indicates so. Therefore the test is pointless unless we get hold of such a trajectory in the old format. Do you know where to get one @AdamGleave ?
Thanks for flagging this. I think I removed the backwards-compatibility logic in load
in https://github.com/HumanCompatibleAI/imitation/pull/337 but forgot to remove the script. However, this script convert_trajs_in_place.py
has a new purpose now that https://github.com/HumanCompatibleAI/imitation/pull/448 is merged. On master
it does produce different output:
$ python -m imitation.scripts.convert_trajs_in_place tests/testdata/expert_models/cartpole_0/rollouts/final.pkl
$ git status
On branch master
Your branch is up to date with 'origin/master'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: tests/testdata/expert_models/cartpole_0/rollouts/final.pkl
$ file tests/testdata/expert_models/cartpole_0/rollouts/final.pkl # note it was pkl before
tests/testdata/expert_models/cartpole_0/rollouts/final.pkl: Zip archive data, at least v2.0 to extract, compression method=deflate
That said, the documentation in the script is out of date, and it should probably rename pkl->zip to avoid confusion. I'll make a PR addressing that.
Note this branch is quite a bit behind master
at this point, you may want to merge it in so you can see these changes...
Codecov Report
Merging #445 (c9dcd73) into master (dad9a30) will increase coverage by
0.07%
. The diff coverage is94.16%
.
@@ Coverage Diff @@
## master #445 +/- ##
==========================================
+ Coverage 97.08% 97.16% +0.07%
==========================================
Files 84 85 +1
Lines 7658 7646 -12
==========================================
- Hits 7435 7429 -6
+ Misses 223 217 -6
Impacted Files | Coverage Δ | |
---|---|---|
src/imitation/data/rollout.py | 100.00% <ø> (ø) |
|
src/imitation/scripts/parallel.py | 59.64% <ø> (-3.85%) |
:arrow_down: |
src/imitation/scripts/config/train_imitation.py | 67.79% <50.00%> (-5.09%) |
:arrow_down: |
src/imitation/scripts/config/train_adversarial.py | 70.88% <60.00%> (+6.78%) |
:arrow_up: |
src/imitation/util/registry.py | 90.69% <66.66%> (+1.80%) |
:arrow_up: |
src/imitation/algorithms/bc.py | 100.00% <100.00%> (ø) |
|
src/imitation/policies/serialize.py | 100.00% <100.00%> (ø) |
|
src/imitation/scripts/common/common.py | 97.22% <100.00%> (ø) |
|
src/imitation/scripts/common/demonstrations.py | 100.00% <100.00%> (ø) |
|
src/imitation/scripts/common/expert.py | 100.00% <100.00%> (ø) |
|
... and 15 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
The remaining failing test is for benchmark_and_table.sh
. It fails since it calls bc_benchmark.sh
with the --paper
option. This will try to train a seals_half_cheetah
policy which fails since mujoco has been removed in #515 .
No I am quite confused:
Why did this not make the pipeline fail earlier?
Answer: After #515, the Docker image has not been updated even though the Dockerfile has been changed. Therefore mujoco was/is still present in the (now stale) Dockerfile.
But then: why is mujoco gone now? Since the test fails with the usual
Exception:
You appear to be missing MuJoCo. We expected to find the file here: /root/.mujoco/mujoco210
This package only provides python bindings, the library must be installed separately.
Please follow the instructions on the README to install MuJoCo
https://github.com/openai/mujoco-py#install-mujoco
Which can be downloaded from the website
https://www.roboti.us/index.html
message.
The remaining failing test is for
benchmark_and_table.sh
. It fails since it callsbc_benchmark.sh
with the--paper
option.
It looks like it was skipped before, because it was in the USES_FULL_ROLLOUTS
tuple in test_experiments.py
. You removed that check as part of the port, so it's no longer skipped.
I've pushed a commit that resolves this by having benchmark_and_table.sh
no longer pass the --paper
flag to bc_benchmark.sh
and dagger_benchmark.sh
when --fast
is specified. I've also taken the liberty of removing mujoco-py
from setup.py
test requirements as I think it's no longer needed given this change (and will break on CI anyway as you observed).
I've also updated the Docker image on DockerHub now -- thanks for catching that.
Not sure why the coverage suddenly dropped in a totally unrelated file.
The naive solution which would be just running a config of preference comparisons with the active_selection
flag set fails for some reason. Before I dig deeper into this I am waiting for your thoughs @AdamGleave .
The coverage seems fine to me -- the only check that's failing is an optional coverage check on the patch. There seems to be a bunch of indirect coverage changes due to you changing the tests -- so some environment named configs that were previously covered no longer are. This seems harmless.