reward-learning-rl
reward-learning-rl copied to clipboard
Docker Installation
Hi,
The docker installation is not complete or the docker is not working. I would appreciate you put the complete installation here.
Thanks,
Could you tell me what steps you followed and what errors you ran into when attempting the docker installation? Do you have nvidia-docker
and docker-compose
installed and setup on your machine?
I have both of them installed. Here are the steps I followed to both install and run the docker:
nasimshafiee@riverlab-asi:~$ export MJKEY="$(cat ~/.mujoco/mjkey.txt)" && sudo docker-compose -f reward-learning-rl/docker/docker-compose.dev.cpu.yml up -d --force-recreate Recreating softlearning-dev-cpu ... done
sudo docker container list CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ba72506cc7d9 softlearning:latest-cpu "/usr/bin/tini -- /e…" About a minute ago Up About a minute 0.0.0.0:32773->5000/tcp, 0.0.0.0:32772->6006/tcp, 0.0.0.0:32771->8888/tcp softlearning-dev-cpu
nasimshafiee@riverlab-asi:~$ sudo docker exec -it softlearning-dev-cpu bash (softlearning) root@ba72506cc7d9:~/softlearning#
(softlearning) root@ba72506cc7d9:~/softlearning# softlearning run_example_local examples.classifier_rl \
--n_goal_examples 10
--task=Image48SawyerDoorPullHookEnv-v0
--algorithm VICERAQ
--num-samples 5
--n_epochs 300
--active_query_frequency 10 bash: softlearning: command not found
Could you try pip install -e .
and then try running the same command again?
pip install -e . Obtaining file:///root/softlearning Installing collected packages: softlearning Found existing installation: softlearning 0.0.1 Uninstalling softlearning-0.0.1: Successfully uninstalled softlearning-0.0.1 Running setup.py develop for softlearning Successfully installed softlearning (softlearning) root@ba72506cc7d9:~/softlearning# softlearning run_example_local examples.classifier_rl --n_goal_examples 10 --task=Image48SawyerDoorPullHookEnv-v0 --algorithm VICERAQ --num-samples 5 --n_epochs 300 --active_query_frequency 10 /opt/conda/envs/softlearning/lib/python3.6/site-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.25.1) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning)
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
- https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
- https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue.
Traceback (most recent call last):
File "/opt/conda/envs/softlearning/bin/softlearning", line 11, in
Interesting. Looks like dm_control
is causing problems, but I don't see these on my end. @hartikainen any idea what might be going on?
Unfortunately, I don't know exactly what's causing the dm_control
import to fail. The latest softlearning has a fix that makes the dm_control
and robosuite
packages optional: https://github.com/rail-berkeley/softlearning/blob/1f6686d765052c874dcf28f8036acde742decd79/softlearning/environments/utils.py#L7.
Hi, I faced problem using Docker Installation. I will be appreciated if you can help me with that. Here is my log:
(softlearning) root@fdcf41e30e0c:~/softlearning# softlearning run_example_local examples.classifier_rl --n_goal_examples 10 --task=Image48SawyerDoorPullHookEnv-v0 --algorithm VICERAQ --num-samples 5 --n_epochs 300 --active_query_frequency 10 /opt/conda/envs/softlearning/lib/python3.6/site-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.25.1) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning)
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
- https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
- https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue.
WARNING: Logging before flag parsing goes to stderr. I0524 16:18:57.085572 139928440866624 acceleratesupport.py:13] OpenGL_accelerate module loaded I0524 16:18:57.091506 139928440866624 arraydatatype.py:270] Using accelerated ArrayDatatype I0524 16:18:57.219949 139928440866624 init.py:34] MuJoCo library version is: 200 I0524 16:18:58.361702 139928440866624 init.py:333] Registering multiworld mujoco gym environments I0524 16:18:58.415774 139928440866624 init.py:14] Registering goal example multiworld mujoco gym environments 2019-05-24 16:18:58,483 INFO node.py:469 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-24_16-18-58_657/logs. 2019-05-24 16:18:58,592 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:27938 to respond... 2019-05-24 16:18:58,717 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:27410 to respond... 2019-05-24 16:18:58,720 INFO services.py:804 -- Starting Redis shard with 1.65 GB max memory.
====================================================================== View the dashboard at http://172.18.0.2:8080/?token=94f192e32354ed6911bacd705f2ed3d4f494466516bddbdb
2019-05-24 16:18:58,839 INFO node.py:483 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-24_16-18-58_657/logs. 2019-05-24 16:18:58,839 WARNING services.py:1304 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'. 2019-05-24 16:18:58,839 INFO services.py:1427 -- Starting the Plasma object store with 2.48 GB memory using /tmp. 2019-05-24 16:18:58,996 INFO tune.py:64 -- Did not find checkpoint file in /root/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-05-24T16-18-58-2019-05-24T16-18-58. 2019-05-24 16:18:58,996 INFO tune.py:211 -- Starting a new experiment. == Status == Using FIFO scheduling algorithm. Resources requested: 0/8 CPUs, 0/0 GPUs Memory usage on this node: 5.6/8.3 GB
== Status == Using FIFO scheduling algorithm. Resources requested: 8/8 CPUs, 0/0 GPUs Memory usage on this node: 5.6/8.3 GB Result logdir: /root/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-05-24T16-18-58-2019-05-24T16-18-58 Number of trials: 5 ({'RUNNING': 1, 'PENDING': 4}) PENDING trials:
- 5933374b-algorithm=VICERAQ-seed=9594: PENDING
- 97f1498b-algorithm=VICERAQ-seed=6933: PENDING
- af9f5ddb-algorithm=VICERAQ-seed=2833: PENDING
- b4794683-algorithm=VICERAQ-seed=2191: PENDING RUNNING trials:
- 79f20978-algorithm=VICERAQ-seed=3699: RUNNING
(pid=729) /opt/conda/envs/softlearning/lib/python3.6/site-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.25.1) or chardet (3.0.4) doesn't match a supported version!
(pid=729) RequestsDependencyWarning)
(pid=729)
(pid=729) WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
(pid=729) For more information, please see:
(pid=729) * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
(pid=729) * https://github.com/tensorflow/addons
(pid=729) If you depend on functionality not listed there, please file an issue.
(pid=729)
(pid=729) 2019-05-24 16:19:02.423031: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
(pid=729) Using seed 3699
(pid=729) 2019-05-24 16:19:02.449192: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3392375000 Hz
(pid=729) 2019-05-24 16:19:02.449746: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x563b744be830 executing computations on platform Host. Devices:
(pid=729) 2019-05-24 16:19:02.449773: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0):
(pid=729) ERROR: GLEW initalization error: Missing GL version (pid=729) (pid=729) Press Enter to exit ... 2019-05-24 16:37:11,180 ERROR worker.py:1672 -- A worker died or was killed while executing task 000000009edd923385c29d9598c6c72491af0aac. 2019-05-24 16:37:11,180 ERROR trial_runner.py:494 -- Error processing event. Traceback (most recent call last): File "/opt/conda/envs/softlearning/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 443, in _process_trial result = self.trial_executor.fetch_result(trial) File "/opt/conda/envs/softlearning/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 315, in fetch_result result = ray.get(trial_future[0]) File "/opt/conda/envs/softlearning/lib/python3.6/site-packages/ray/worker.py", line 2193, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2019-05-24 16:37:11,181 INFO ray_trial_executor.py:179 -- Destroying actor for trial 79f20978-algorithm=VICERAQ-seed=3699. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads. == Status == Using FIFO scheduling algorithm. Resources requested: 0/8 CPUs, 0/0 GPUs Memory usage on this node: 5.9/8.3 GB Result logdir: /root/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-05-24T16-18-58-2019-05-24T16-18-58 Number of trials: 5 ({'ERROR': 1, 'PENDING': 4}) ERROR trials:
- 79f20978-algorithm=VICERAQ-seed=3699: ERROR, 1 failures: /root/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-05-24T16-18-58-2019-05-24T16-18-58/79f20978-algorithm=VICERAQ-seed=3699_2019-05-24_16-18-593ri44dsa/error_2019-05-24_16-37-11.txt PENDING trials:
- 5933374b-algorithm=VICERAQ-seed=9594: PENDING
- 97f1498b-algorithm=VICERAQ-seed=6933: PENDING
- af9f5ddb-algorithm=VICERAQ-seed=2833: PENDING
- b4794683-algorithm=VICERAQ-seed=2191: PENDING
(pid=823) /opt/conda/envs/softlearning/lib/python3.6/site-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.25.1) or chardet (3.0.4) doesn't match a supported version!
(pid=823) RequestsDependencyWarning)
(pid=823)
(pid=823) WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
(pid=823) For more information, please see:
(pid=823) * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
(pid=823) * https://github.com/tensorflow/addons
(pid=823) If you depend on functionality not listed there, please file an issue.
(pid=823)
(pid=823) 2019-05-24 16:37:14.102711: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
(pid=823) Using seed 9594
(pid=823) 2019-05-24 16:37:14.125247: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3392375000 Hz
(pid=823) 2019-05-24 16:37:14.125840: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55dff0c2fa80 executing computations on platform Host. Devices:
(pid=823) 2019-05-24 16:37:14.125871: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0):
@Mehrdad-Dorostian This error is a bit hard to interpret because of all the ray
messages. Could you try running softlearning run_example_debug examples.classifier_rl --n_goal_examples 10 --task=Image48SawyerDoorPullHookEnv-v0 --algorithm VICERAQ --n_epochs 300 --active_query_frequency 10
and post the output here?
I have the same issue when I'm running run_example_debug command:
softlearning run_example_debug examples.classifier_rl --n_goal_examples 10 --task=Image48SawyerDoorPullHookEnv-v0 --algorithm VICERAQ --n_epochs 300 --active_query_frequency 10
/opt/conda/envs/softlearning/lib/python3.6/site-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.25.1) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning)
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
- https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
- https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue.
WARNING: Logging before flag parsing goes to stderr. I0528 19:15:46.675933 139764420859712 acceleratesupport.py:13] OpenGL_accelerate module loaded I0528 19:15:46.681612 139764420859712 arraydatatype.py:270] Using accelerated ArrayDatatype I0528 19:15:46.935050 139764420859712 init.py:34] MuJoCo library version is: 200
I0528 19:15:48.273273 139764420859712 init.py:333] Registering multiworld mujoco gym environments I0528 19:15:48.326326 139764420859712 init.py:14] Registering goal example multiworld mujoco gym environments 2019-05-28 19:15:48,394 INFO tune.py:64 -- Did not find checkpoint file in /root/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-05-28T19-15-48-2019-05-28T19-15-48. 2019-05-28 19:15:48,394 INFO tune.py:211 -- Starting a new experiment. == Status == Using FIFO scheduling algorithm. Resources requested: 0/8 CPUs, 0/1 GPUs Memory usage on this node: 10.6/33.6 GB
Using seed 2866
2019-05-28 19:15:48.405032: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-28 19:15:48.428803: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-05-28 19:15:48.429511: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5627fe6bb0c0 executing computations on platform Host. Devices:
2019-05-28 19:15:48.429537: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0):
Press Enter to exit
Thanks for your response! I ran "softlearning run_example_debug examples.classifier_rl --n_goal_examples 10 --task=Image48SawyerDoorPullHookEnv-v0 --algorithm VICERAQ --n_epochs 300 --active_query_frequency 10" and this is my log:
/opt/conda/envs/softlearning/lib/python3.6/site-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.25.1) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning)
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
- https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
- https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue.
WARNING: Logging before flag parsing goes to stderr. I0528 17:41:52.373401 139931832047424 acceleratesupport.py:13] OpenGL_accelerate module loaded I0528 17:41:52.378882 139931832047424 arraydatatype.py:270] Using accelerated ArrayDatatype I0528 17:41:52.512549 139931832047424 init.py:34] MuJoCo library version is: 200 I0528 17:41:53.840228 139931832047424 init.py:333] Registering multiworld mujoco gym environments I0528 17:41:53.892477 139931832047424 init.py:14] Registering goal example multiworld mujoco gym environments 2019-05-28 17:41:53,958 INFO tune.py:64 -- Did not find checkpoint file in /root/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-05-28T17-41-53-2019-05-28T17-41-53. 2019-05-28 17:41:53,958 INFO tune.py:211 -- Starting a new experiment. == Status == Using FIFO scheduling algorithm. Resources requested: 0/8 CPUs, 0/1 GPUs Memory usage on this node: 10.4/33.6 GB
Using seed 4148
2019-05-28 17:41:53.969472: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-28 17:41:53.992703: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-05-28 17:41:53.993372: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55a06bd030c0 executing computations on platform Host. Devices:
2019-05-28 17:41:53.993388: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0):
Press Enter to exit ...
@Mehrdad-Dorostian @NasimShafiee Could you tell me what GPU do you have on your system? What's the output of nvidia-smi
when you run it from within the docker container?
Fri Jun 7 18:08:52 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27 Driver Version: 415.27 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 45% 61C P2 117W / 260W | 707MiB / 10988MiB | 43% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1224 G /usr/lib/xorg/Xorg 202MiB | | 0 2780 G compiz 79MiB | | 0 4555 C+G python3.6 373MiB | | 0 4626 G ...uest-channel-token=14554047989433986600 47MiB | +-----------------------------------------------------------------------------+
I cannot access nvidia-smi inside softlearning-dev-cpu. Also, I cannot create softlearning-dev-gpu image:
WARNING: The SOFTLEARNING_DEV_GPU_TAG variable is not set. Defaulting to a blank string. ERROR: no such image: avi-softlearning:: invalid reference format
I have some problems in using docker installation. pip subprocess error: Running command git clone -q https://github.com/deepmind/dm_control.git /tmp/pip-req-build-f94_w0g4 Running command git checkout -q 0260f3effcfe2b0fdb25d9652dc27ba34b52d226 Running command git clone -q https://github.com/avisingh599/multiworld.git /tmp/pip-req-build-3wyolddc Running command git checkout -q 19bf319422c0016260166bf64e194552bf2a9e68 Running command git clone -q https://github.com/hartikainen/mujoco-py.git /tmp/pip-req-build-5687gubb Running command git checkout -q 29fcd26290c9417aef0f82d0628d29fa0dbf0fab fatal: reference is not a tree: 29fcd26290c9417aef0f82d0628d29fa0dbf0fab ERROR: Command errored out with exit status 128: git checkout -q 29fcd26290c9417aef0f82d0628d29fa0dbf0fab Check the logs for full command output.
CondaEnvException: Pip failed
ERROR: Service 'softlearning-dev-cpu' failed to build: The command '/bin/bash -c echo "${MJKEY}" > ~/.mujoco/mjkey.txt && conda env update -f /tmp/environment.yml && conda clean --all -y && rm ~/.mujoco/mjkey.txt' returned a non-zero code: 1