balloon-learning-environment
balloon-learning-environment copied to clipboard
Error when running distributed_train_acme_qrdqn.py
Hi, I have been trying to get 'distributed_train_acme_qrdqn.py' to run with only a few agents and I'm getting the following error. I think it might be an issue between jax, dm-acme, and dm-launchpad.
I did some digging and came across this acme/agents/jax/actors
This is where I get stuck as I'm not entirely sure how the Qr-DQN is built with jax and passed to launchpad. I would really appreciate any thoughts on this issue.
Operating System
- Python 3.9.13
- Ubuntu 20.04
Error
/usr/local/lib/python3.9/dist-packages/haiku/_src/data_structures.py:37: FutureWarning: jax.tree_structure is deprecated, and will be removed in a future release. Use jax.tree_util.tree_structure instead. PyTreeDef = type(jax.tree_structure(None)) I0908 13:09:34.228399 140062111078208 xla_bridge.py:345] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: I0908 13:09:34.228528 140062111078208 xla_bridge.py:345] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' I0908 13:09:34.228579 140062111078208 xla_bridge.py:345] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' I0908 13:09:34.229399 140062111078208 xla_bridge.py:345] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available. W0908 13:09:34.229537 140062111078208 xla_bridge.py:352] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) I0908 13:09:34.483748 140052206171904 courier_utils.py:120] Binding: run I0908 13:09:34.487003 140052206171904 lp_utils.py:87] StepsLimiter: Starting with max_steps = 9600000 (actor_steps) I0908 13:09:34.487962 140050360694528 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:34.488504 140052214564608 savers.py:164] Attempting to restore checkpoint: None I0908 13:09:35.382974 140050352301824 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.442431 140046896195328 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.453237 140046232733440 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.453534 140052214564608 courier_utils.py:120] Binding: get_counts I0908 13:09:35.463889 140046132836096 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.473653 140046098515712 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.482568 140052214564608 courier_utils.py:120] Binding: get_directory I0908 13:09:35.483737 140045998618368 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.503851 140045981832960 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.504534 140045923084032 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.504815 140052214564608 courier_utils.py:120] Binding: get_steps_key I0908 13:09:35.524922 140045914691328 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.525063 140052214564608 courier_utils.py:120] Binding: increment I0908 13:09:35.525359 140045822371584 node.py:61] Reverb client connecting to: localhost:33011 I0908 13:09:35.526567 140052214564608 courier_utils.py:120] Binding: restore I0908 13:09:35.533543 140052214564608 courier_utils.py:120] Binding: save I0908 13:09:35.542086 140052214564608 savers.py:155] Saving checkpoint: /root/acme/20220908-130931/checkpoints/counter I0908 13:09:36.944851 140052206171904 lp_utils.py:95] StepsLimiter: Reached 0 recorded steps Node ThreadWorker(thread=<Thread(actor, stopped daemon 140045923084032)>, future=<Future at 0x7f61f80a19a0 state=finished raised AttributeError>) crashed: Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/launchpad/launch/worker_manager.py", line 474, in _check_workers worker.future.result() File "/usr/lib/python3.9/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/usr/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result raise self._exception File "/usr/local/lib/python3.9/dist-packages/launchpad/launch/worker_manager.py", line 250, in run_inner future.set_result(f()) File "/usr/local/lib/python3.9/dist-packages/launchpad/nodes/python/node.py", line 75, in _construct_function return functools.partial(self._function, *args, **kwargs)() File "/usr/local/lib/python3.9/dist-packages/launchpad/nodes/courier/node.py", line 113, in run instance = self._construct_instance() # pytype:disable=wrong-arg-types File "/usr/local/lib/python3.9/dist-packages/launchpad/nodes/python/node.py", line 180, in _construct_instance self._instance = self._constructor(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/acme/jax/experiments/make_distributed_experiment.py", line 169, in build_actor actor = experiment.builder.make_actor(actor_key, policy_network, File "/usr/local/lib/python3.9/dist-packages/acme/agents/jax/dqn/builder.py", line 99, in make_actor return actors.GenericActor( File "/usr/local/lib/python3.9/dist-packages/acme/agents/jax/actors.py", line 67, in init self._init = jax.jit(actor.init, backend=backend) AttributeError: 'function' object has no attribute 'init'
Python Packages
absl-py 0.15.0 ale-py 0.7.3 astunparse 1.6.3 async-generator 1.10 atari-py 0.2.9 attrs 22.1.0 bsuite 0.3.5 cached-property 1.5.2 cachetools 4.2.4 certifi 2021.10.8 chardet 3.0.4 charset-normalizer 2.0.7 chex 0.1.4 clang 5.0 cloudpickle 2.0.0 colorama 0.4.5 commonmark 0.9.1 cycler 0.11.0 dbus-python 1.2.16 decorator 5.1.0 dill 0.3.5.1 distrax 0.1.2 dm-acme 0.4.1 dm-control 0.0.364896371 dm-env 1.5 dm-haiku 0.0.7 dm-launchpad 0.5.2 dm-reverb 0.7.2 dm-sonnet 2.0.0 dm-tree 0.1.6 docker 6.0.0 dopamine-rl 4.0.0 etils 0.7.1 execnet 1.9.0 flatbuffers 1.12 flax 0.5.3 fonttools 4.37.1 frozendict 2.3.4 future 0.18.2 gast 0.4.0 gin 0.1.6 gin-config 0.5.0 glfw 2.5.4 google-api-core 2.8.2 google-api-python-client 2.58.0 google-auth 1.35.0 google-auth-httplib2 0.1.0 google-auth-oauthlib 0.4.6 google-cloud-aiplatform 1.16.1 google-cloud-bigquery 2.34.4 google-cloud-core 2.3.2 google-cloud-resource-manager 1.6.1 google-cloud-storage 2.5.0 google-crc32c 1.3.0 google-pasta 0.2.0 google-resumable-media 2.3.3 googleapis-common-protos 1.56.4 grpc-google-iam-v1 0.12.4 grpcio 1.47.0 grpcio-status 1.47.0 gym 0.21.0 h5py 3.1.0 httplib2 0.20.4 humanize 4.3.0 idna 3.3 imageio 2.21.2 immutabledict 2.2.1 importlab 0.7 importlib-metadata 4.8.1 importlib-resources 5.4.0 iniconfig 1.1.1 jax 0.3.16 jaxlib 0.3.14 jmp 0.0.2 joblib 1.1.0 keras 2.8.0 Keras-Preprocessing 1.1.2 kiwisolver 1.3.2 kubernetes 24.2.0 labmaze 1.0.5 libclang 12.0.0 libcst 0.4.7 lxml 4.9.1 Markdown 3.3.4 matplotlib 3.5.3 mizani 0.7.4 mock 4.0.3 msgpack 1.0.2 mypy-extensions 0.4.3 networkx 2.8.6 ninja 1.10.2.3 numpy 1.22.4 oauthlib 3.1.1 opencv-python 4.5.4.58 opensimplex 0.3 opt-einsum 3.3.0 optax 0.0.9 packaging 21.3 palettable 3.3.0 pandas 1.4.4 patsy 0.5.2 Pillow 8.4.0 pip 22.2.2 plotnine 0.9.0 pluggy 1.0.0 portpicker 1.5.2 promise 2.3 proto-plus 1.22.1 protobuf 3.19.1 psutil 5.9.1 py 1.11.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pygame 2.1.0 Pygments 2.13.0 PyGObject 3.36.0 PyOpenGL 3.1.6 pyparsing 3.0.4 pytest 7.1.2 pytest-forked 1.4.0 pytest-xdist 2.5.0 python-apt 2.0.0+ubuntu0.20.4.8 python-dateutil 2.8.2 pytype 2021.8.11 pytz 2021.3 PyWavelets 1.3.0 PyYAML 6.0 requests 2.26.0 requests-oauthlib 1.3.0 requests-unixsocket 0.2.0 rich 11.2.0 rlax 0.1.4 rlds 0.1.5 rsa 4.7.2 s2sphere 0.2.5 scikit-image 0.19.3 scikit-learn 1.0.1 scipy 1.7.1 setuptools 45.2.0 six 1.15.0 sklearn 0.0 SQLAlchemy 1.2.19 statsmodels 0.13.2 tabulate 0.8.10 tensorboard 2.8.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.0 tensorflow 2.8.0 tensorflow-datasets 4.5.2 tensorflow-estimator 2.8.0 tensorflow-io-gcs-filesystem 0.26.0 tensorflow-metadata 1.10.0 tensorflow-probability 0.15.0 tensorstore 0.1.23 termcolor 1.1.0 tf-estimator-nightly 2.8.0.dev2021122109 tf-slim 1.1.0 tfp-nightly 0.15.0.dev20211104 threadpoolctl 3.0.0 tifffile 2022.8.12 toml 0.10.2 tomli 2.0.1 toolz 0.11.1 tqdm 4.64.0 transitions 0.8.10 trfl 1.2.0 typed-ast 1.5.4 typing_extensions 4.3.0 typing-inspect 0.8.0 uritemplate 4.1.1 urllib3 1.26.7 websocket-client 1.4.0 Werkzeug 2.0.2 wheel 0.34.2 wrapt 1.12.1 xmanager 0.2.0 zipp 3.6.0
Hi,
Thanks for reporting this issue. After playing around with it myself it looks likely that Acme's API has changed. I'll update here once I've made progress.
What versions of Acme / launchpad did you use as I can downgrade to get it to work.
I'm not sure what version would work, but it's probably worth trying with version 0.4.0, since it was released just before the BLE, and we tested the agents at release https://pypi.org/project/dm-acme/0.4.0/#history.
I'm looking at upgrading this today. I'll respond here once I've tested and pushed the changes.
Here's the status update:
- I have code ready to check in to upgrade the Acme agents to Acme's new experiments API, but Acme hasn't been pushed to PyPI in a while. I'm checking in with them to see if they will be pushing in the near future. Once they push their code, I will submit my changes.
- Realizing that Acme hasn't pushed in a while, I double checked that Acme 0.4.0 still works with the current BLE version on PyPI and found that it does when not using launchpad.
Were you able to get it working with Acme 0.4.0?
Thank you for updating the repository, I wasn't able to get it working with launchpad 0.4.0 either, I believe they removed the caching node at some point which might be causing compatibility issues.