mujoco-py
mujoco-py copied to clipboard
Unable to fix buildlock RuntimeError
RuntimeError: Unable to acquire lock on b'/home/.../cs2716/spinningup/env/lib/python3.6/site-packages/mujoco_py/generated/mujocopy-buildlock'
due to [Errno 5] Input/output error.
I have previously deleted all the buildlocks generated, and mujocopy-buildlock, in an attempt to fix the issue. I have re-installed mujoco-py and gym and am still getting the runtime error.
Running into the same issue, specifically when I try to build from running import mujoco_py
in the command line.
I'm running this in an HPC cluster environment and unfortunately do not have sudo access to install any of the dependencies for Linux (Ubuntu 18.04.5 (Bionic Beaver)
). Are the dependencies in the provided DockerFile in the instructions to install mujoco_py
needed?
Package and environment specifications:
- This is being run in a
conda env
. - OS:
Ubuntu 18.04.5 (Bionic Beaver)
. -
python==3.8.3
-
mujoco_py==2.0.8.2
-
gym==0.17.3
-
pip==20.2.4
Thank you very much for your help!
@cs2716 I realized this was actually on the server side - the filesystem in which I was trying to run this does not allow for locking. Running it in a file location that allows for locking fixed the problem.
@rmsander I'm getting the same issue and I'm a bit stuck. While I imagine your solution is specific to the server you were using, can you give any further insights to your solution?
@rmsander I'm getting the same issue and I'm a bit stuck. While I imagine your solution is specific to the server you were using, can you give any further insights to your solution?
Hi @rallen10, for sure! Also, unrelated, but I believe we met at the YuleFest 2019 5K? I work in MIT Distributed Robotics Lab.
I only bring this up because I was wondering if you're running this on MIT Supercloud
? This was the server I was referring to in my post above - I found that because you don't have locking, I am unable to use mujoco_py
unless I specifically install it in the /state/partition1/user/
directory (where locking is permitted, at least from what I understand).
If you're using supercloud
, below is a bash script I used to get around the issue (you may have to modify some of the commands, such as the conda environment you use):
#!/bin/bash
#SBATCH -c 10
#SBATCH -n 1
#SBATCH --exclusive
#SBATCH --gres=gpu:volta:1
conda init bash
source ~/.bashrc
conda activate interreplay
# Make new folder there
TMPFILE=`mktemp XXXXXX`
mkdir /state/partition1/user/$TMPFILE
# Copy mujoco-py folder to locked part of cluster
cp -r ~/mujoco-py /state/partition1/user/$TMPFILE/
cd /state/partition1/user/$TMPFILE/mujoco-py
# Now install it and import it to build
python3 setup.py install
python3 -c "import mujoco_py"
# Now move code to this folder and mujoco-py into code
cp -r ~/interreplay /state/partition1/user/$TMPFILE/
cp -r mujoco_py ../interreplay/
# Change direcrory to interreplay
cd ../interreplay
# Run code! (With parameters)
python3 my_script.py <parameters>
# Finally, remove temporary directory
rm /state/partition1/user/$TMPFILE
And if you're not running on Supercloud, specifically I think the issue was that the mujoco-py
code was being from a part of the server that does not enable for POSIX locking.
Hope this helps, and feel free to follow up!
@rmsander Hahahaha! That is the most fantastically specific response I've gotten on a github issue! Yes we would have met at the 5K, and yes I am trying to run this on supercloud!
Thank you so much for your fix; I'll give it a shot. I wish all of my github issues could get such a tailor-made solution!
@rallen10 Haha this is great! I've gotta say this is the craziest thing that's ever happened to me on GitHub :)
Hope this helps, and please feel free to email me [email protected] if it doesn't! All the best with your project!
I am having the same issue on supercloud. It looks like it is due to the NFS missing the file lock function. My error is the following:
Traceback (most recent call last):
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 99, in _try_acquire
self.trylock()
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 217, in trylock
self._trylock(self.lockfile)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 250, in _trylock
fcntl.lockf(lockfile, fcntl.LOCK_EX | fcntl.LOCK_NB)
OSError: [Errno 38] Function not implemented
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 145, in make
return registry.make(id, **kwargs)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 90, in make
env = spec.make(**kwargs)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 59, in make
cls = load(self.entry_point)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 18, in load
mod = importlib.import_module(mod_name)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/__init__.py", line 1, in <module>
from gym.envs.mujoco.mujoco_env import MujocoEnv
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 12, in <module>
import mujoco_py
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/__init__.py", line 3, in <module>
from mujoco_py.builder import cymj, ignore_mujoco_warnings, functions, MujocoException
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/builder.py", line 510, in <module>
cymj = load_cython_ext(mujoco_path)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/builder.py", line 89, in load_cython_ext
with fasteners.InterProcessLock(lockpath):
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 179, in __enter__
gotten = self.acquire()
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 161, in acquire
gotten = r(self._try_acquire, blocking, watch)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/_utils.py", line 121, in __call__
return fn(*args, **kwargs)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 112, in _try_acquire
'exception': e,
RuntimeError: Unable to acquire lock on `b'/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/generated/mujocopy-buildlock'` due to [Errno 38] Function not implemented
I am having the same issue on supercloud. It looks like it is due to the NFS missing the file lock function. My error is the following:
Traceback (most recent call last): File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 99, in _try_acquire self.trylock() File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 217, in trylock self._trylock(self.lockfile) File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 250, in _trylock fcntl.lockf(lockfile, fcntl.LOCK_EX | fcntl.LOCK_NB) OSError: [Errno 38] Function not implemented During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 145, in make return registry.make(id, **kwargs) File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 90, in make env = spec.make(**kwargs) File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 59, in make cls = load(self.entry_point) File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 18, in load mod = importlib.import_module(mod_name) File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 994, in _gcd_import File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 665, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 678, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/__init__.py", line 1, in <module> from gym.envs.mujoco.mujoco_env import MujocoEnv File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 12, in <module> import mujoco_py File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/__init__.py", line 3, in <module> from mujoco_py.builder import cymj, ignore_mujoco_warnings, functions, MujocoException File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/builder.py", line 510, in <module> cymj = load_cython_ext(mujoco_path) File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/builder.py", line 89, in load_cython_ext with fasteners.InterProcessLock(lockpath): File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 179, in __enter__ gotten = self.acquire() File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 161, in acquire gotten = r(self._try_acquire, blocking, watch) File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/_utils.py", line 121, in __call__ return fn(*args, **kwargs) File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 112, in _try_acquire 'exception': e, RuntimeError: Unable to acquire lock on `b'/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/generated/mujocopy-buildlock'` due to [Errno 38] Function not implemented
Hi @geyang, thanks for pointing out this issue. I was having the exact same issue on my end as well on Supercloud, and as you said, I believe the issue with running mujoco-py
on supercloud is that the NFS is missing the file lock function. My solution was to install mujoco-py
to /state/partition/user/mujoco-py
(see the commands below from the bash script above) from my directory, and to run the setup commands here to build the package.
# Copy mujoco-py folder to locked part of cluster
cp -r ~/mujoco-py /state/partition1/user/$TMPFILE/ # You should first download mujoco-py to ~/ with git clone
cd /state/partition1/user/$TMPFILE/mujoco-py
# Now install it and import it to build
python3 setup.py install # Run install commands inside the locked part of the server
python3 -c "import mujoco_py" # This will set up the mujoco-py configuration
Please let me know if this helps - good luck!
@geyang @rallen10 Another issue I had (specifically for supercloud
) is that when I use the script above to run experiments, I cannot reliably run multiple tasks
on the same node. To circumvent this, I am using concurrent experiments in tandem with LLMapReduce
. If this is of interest to you, the scripts I used for this can be found below.
NOTE: This may have been a little "over-engineered", but I think it works better than trying to rebuild mujoco_py
every time I run a new task on supercloud
(which I found was the only way to avoid locking).
Python Script
I run concurrent experiments by making concurrent calls to my main python file. Here, I have created cluster_run.py
, which parses my arguments from the given inputs of inputs.txt
, and then calls my python script that runs mujoco_py
.
"""Script to run multiple tune experiments concurrently, i.e. on the same node.
Calls custom_trainer.py with a given set of arguments."""
import os
import argparse
def parse_inputs():
"""Used as a CLI parser and argument formatter for use in inputting arguments
to the custom_training.py file.py
Returns:
args_list (list): A list of arguments. Each element corresponds to the
CLI parameters for the given call to custom_training.py.
"""
# Create argument parser, add arguments, and parse them
parser = argparse.ArgumentParser()
parser.add_argument("-input_path", "--input_path", type=str,
help="File path location for input arguments")
args = parser.parse_args()
# Begin adding arguments
args_list = []
keys = ["seed", "env", "custom_replay_buffer", "agent_name",
"round_robin_weights", "local_dir", "trainer"]
default_args = ["use_delta", "gaussian_process", "gpytorch", "kneighbors 50",
"prioritized_replay", "mixup_interpolation", "train_size 1000",
"retrain_interval 1000", "kernel matern", "mean_type constant",
"matern_nu 1.5", "global_hyperparams", "use_ard"]
# Parse arguments from input files
with open(args.input_path, "r") as inputs:
for i, line in enumerate(inputs): # First line is additional arguments
if i == 0: # Additional arguments
added_args = line
else:
# Creates string to store CLI config for custom_training.py
args_list.append("")
line_args = line.split(" ")
# Adds arguments for parsed args
for a, key in zip(line_args, keys):
args_list[-1] += "--{} {} ".format(key, a.strip()) # Stripping ensures no new lines are created
# Adds arguments for default args
for d in default_args:
args_list[-1] += "--{} ".format(d.strip()) # Stripping ensures no new lines are created
# Adds final args that are applied to all in call
args_list[-1] += added_args
inputs.close() # Close file
return args_list
def call_experiments(args):
"""Function to stitch together arguments for custom_training.py into a single
concurrent custom_training.py call.
Parameters:
args_list (list): A list of arguments. Each element corresponds to the
CLI parameters for the given call to custom_training.py.
"""
# Create list to store individual calls
command_list = ["python3 custom_training.py {} & ".format(a.strip()) for a in args]
command_str = "" # Initialize command
# String-concatenate single commands into concurrent command
for c in command_list:
command_str += c
command_str = command_str[:-2] # Remove final &+space
# Add final formats
command_str = "(" + command_str + ")"
print("Command is: {}".format(command_str))
# Call command to run concurrent experiments
os.system(command_str)
def main():
"""Main invoked script. Calls functions above to run concurrent experiments."""
args_list = parse_inputs() # Parse arguments and format
call_experiments(args_list) # Run experiments with parsed arguments
if __name__ == '__main__':
main()
Run Script for LLMapReduce
To run LLMapReduce
, you also need to create a run.sh
script that takes parameters from an inputs.txt
file and runs tasks with the parsed parameterization. This is largely the same before, just note that we call a different python script here.
#!/bin/bash
# Change to user directory
cd ~
# Activate conda environment
conda init bash
source ~/.bashrc
conda activate interreplay
# Make new folder there
TMPFILE=`mktemp XXXXXXXXXX`
mkdir /state/partition1/user/$TMPFILE
# Copy mujoco-py folder to locked part of cluster
cp -r ~/mujoco-py /state/partition1/user/$TMPFILE/
cd /state/partition1/user/$TMPFILE/mujoco-py
# Now install it and import it to build
python3 setup.py install
python3 -c "import mujoco_py"
# Now move code to this folder and mujoco-py into code
cp -r ~/interreplay /state/partition1/user/$TMPFILE/
cp -r mujoco_py ../interreplay/
# Change direcrory to interreplay
cd ../interreplay
# Run code! (With parameters
python3 cluster_run.py --input_path $1
# Remove temporary directory
rm -rf /state/partition1/user/$TMPFILE
Inputs File for LLMapReduce
Finally, as aforementioned, for running LLMapReduce
we need to have a file of input parameters to provide to the run.sh
file (this is what mapper.sh
is in charge of). In this case, I am just passing a set of input filenames for each task - each of these file names have the "experiment" parameters that will be placed on each node.
inputs_int_only_1.txt
inputs_int_only_2.txt
inputs_int_only_3.txt
inputs_ll_1.txt
inputs_ll_2.txt
inputs_ll_3.txt
inputs_vanilla_only_1.txt
inputs_vanilla_only_2.txt
Where each file in turn contains inputs, e.gf. for inputs_int_only_1.txt
:
232 HalfCheetah-v2 True 232_k50_interp 5 ~/ll_tests/interp_only_updates SAC
243 HalfCheetah-v2 True 243_k50_interp 5 ~/ll_tests/interp_only_updates SAC
(Note that for the purposes of the script, the first blank line is actually important).
Putting It All Together
You can run this together with LLMapReduce
in supercloud
. Just make sure run.sh
and inputs.txt
are in the same directory, and make sure you have a mapper.sh
file and that you have executable privileges for mapper.sh
and run.sh
(chmod +x mapper.sh run.sh
). Then you can run:
LLMapReduce --mapper mapper.sh --input inputs.txt --output ~/<OUTDIR> --slotsPerTask 40 --np [4,1,1] --gpuNameCount=volta:2 --keep=true
Feel free to follow up, and hope this is helpful!
For some reason I can run this on the login node @rmsander
python -c "import gym;img = gym.make('Reacher-v2').render('rgb_array');print(img.shape)"
But somehow when I run this on a worker node, it gives me the an error message:
Traceback (most recent call last):
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/jaynes/entry.py", line 11, in <module>
fn(*args, **kwargs)
File "/Users/ge/mit/dmc_gen/dmc_gen_analysis/__infra/launch_debug.py", line 13, in gym_render
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/core.py", line 233, in render
return self.env.render(mode, **kwargs)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 145, in render
self._get_viewer(mode).render(width, height, camera_id=camera_id)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 172, in _get_viewer
self.viewer = mujoco_py.MjRenderContextOffscreen(self.sim, -1)
File "mjrendercontext.pyx", line 45, in mujoco_py.cymj.MjRenderContext.__init__
File "mjrendercontext.pyx", line 109, in mujoco_py.cymj.MjRenderContext._setup_opengl_context
ValueError: invalid literal for int() with base 10: 'GPU-ab488f30-aabb-f304-ddf1-875fa3ac7df9'
srun: error: d-13-12-1: task 0: Exited with exit code 1
This is after I run the following commands:
LC_CTYPE=en_US.UTF-8 LANG=en_US.UTF-8 LANGUAGE=en_US
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/gridsan/geyang/.mujoco/mujoco200/bin
DMCGEN_DATA=$HOME/mit/dmc_gen/custom_vendor/data
startup: >-
source ~/.bashrc;
module load cuda/11.0;
module load anaconda/2020b;
source activate dmcgen;
cp -r /home/gridsan/$USER/mujoco-py /state/partition1/user/$USER;
sleep 10;
echo "finished copying";
But somehow when I run this on a worker node, it gives me the an error message:
use the following on the worker node (or in submission script)
export CUDA_VISIBLE_DEVICES=0,1