mujoco-py icon indicating copy to clipboard operation
mujoco-py copied to clipboard

Unable to fix buildlock RuntimeError

Open cs2716 opened this issue 5 years ago • 11 comments

RuntimeError: Unable to acquire lock on b'/home/.../cs2716/spinningup/env/lib/python3.6/site-packages/mujoco_py/generated/mujocopy-buildlock' due to [Errno 5] Input/output error.

I have previously deleted all the buildlocks generated, and mujocopy-buildlock, in an attempt to fix the issue. I have re-installed mujoco-py and gym and am still getting the runtime error.

cs2716 avatar Nov 25 '19 16:11 cs2716

Running into the same issue, specifically when I try to build from running import mujoco_py in the command line.

I'm running this in an HPC cluster environment and unfortunately do not have sudo access to install any of the dependencies for Linux (Ubuntu 18.04.5 (Bionic Beaver)). Are the dependencies in the provided DockerFile in the instructions to install mujoco_py needed?

Package and environment specifications:

  1. This is being run in a conda env.
  2. OS: Ubuntu 18.04.5 (Bionic Beaver).
  3. python==3.8.3
  4. mujoco_py==2.0.8.2
  5. gym==0.17.3
  6. pip==20.2.4

Thank you very much for your help!

rmsander avatar Oct 25 '20 03:10 rmsander

@cs2716 I realized this was actually on the server side - the filesystem in which I was trying to run this does not allow for locking. Running it in a file location that allows for locking fixed the problem.

rmsander avatar Oct 27 '20 19:10 rmsander

@rmsander I'm getting the same issue and I'm a bit stuck. While I imagine your solution is specific to the server you were using, can you give any further insights to your solution?

rallen10 avatar Feb 10 '21 01:02 rallen10

@rmsander I'm getting the same issue and I'm a bit stuck. While I imagine your solution is specific to the server you were using, can you give any further insights to your solution?

Hi @rallen10, for sure! Also, unrelated, but I believe we met at the YuleFest 2019 5K? I work in MIT Distributed Robotics Lab.

I only bring this up because I was wondering if you're running this on MIT Supercloud? This was the server I was referring to in my post above - I found that because you don't have locking, I am unable to use mujoco_py unless I specifically install it in the /state/partition1/user/ directory (where locking is permitted, at least from what I understand).

If you're using supercloud, below is a bash script I used to get around the issue (you may have to modify some of the commands, such as the conda environment you use):

#!/bin/bash

#SBATCH -c 10
#SBATCH -n 1
#SBATCH --exclusive
#SBATCH --gres=gpu:volta:1

conda init bash
source ~/.bashrc
conda activate interreplay

# Make new folder there
TMPFILE=`mktemp XXXXXX`
mkdir /state/partition1/user/$TMPFILE

# Copy mujoco-py folder to locked part of cluster
cp -r ~/mujoco-py /state/partition1/user/$TMPFILE/
cd /state/partition1/user/$TMPFILE/mujoco-py

# Now install it and import it to build
python3 setup.py install
python3 -c "import mujoco_py"

# Now move code to this folder and mujoco-py into code
cp -r ~/interreplay /state/partition1/user/$TMPFILE/
cp -r mujoco_py ../interreplay/

# Change direcrory to interreplay
cd ../interreplay

# Run code!  (With parameters)
python3 my_script.py <parameters>

# Finally, remove temporary directory
rm /state/partition1/user/$TMPFILE

And if you're not running on Supercloud, specifically I think the issue was that the mujoco-py code was being from a part of the server that does not enable for POSIX locking.

Hope this helps, and feel free to follow up!

rmsander avatar Feb 10 '21 01:02 rmsander

@rmsander Hahahaha! That is the most fantastically specific response I've gotten on a github issue! Yes we would have met at the 5K, and yes I am trying to run this on supercloud!

Thank you so much for your fix; I'll give it a shot. I wish all of my github issues could get such a tailor-made solution!

rallen10 avatar Feb 10 '21 02:02 rallen10

@rallen10 Haha this is great! I've gotta say this is the craziest thing that's ever happened to me on GitHub :)

Hope this helps, and please feel free to email me [email protected] if it doesn't! All the best with your project!

rmsander avatar Feb 10 '21 02:02 rmsander

I am having the same issue on supercloud. It looks like it is due to the NFS missing the file lock function. My error is the following:

Traceback (most recent call last):
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 99, in _try_acquire
    self.trylock()
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 217, in trylock
    self._trylock(self.lockfile)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 250, in _trylock
    fcntl.lockf(lockfile, fcntl.LOCK_EX | fcntl.LOCK_NB)
OSError: [Errno 38] Function not implemented

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 145, in make
    return registry.make(id, **kwargs)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 90, in make
    env = spec.make(**kwargs)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 59, in make
    cls = load(self.entry_point)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 18, in load
    mod = importlib.import_module(mod_name)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/__init__.py", line 1, in <module>
    from gym.envs.mujoco.mujoco_env import MujocoEnv
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 12, in <module>
    import mujoco_py
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/__init__.py", line 3, in <module>
    from mujoco_py.builder import cymj, ignore_mujoco_warnings, functions, MujocoException
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/builder.py", line 510, in <module>
    cymj = load_cython_ext(mujoco_path)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/builder.py", line 89, in load_cython_ext
    with fasteners.InterProcessLock(lockpath):
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 179, in __enter__
    gotten = self.acquire()
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 161, in acquire
    gotten = r(self._try_acquire, blocking, watch)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/_utils.py", line 121, in __call__
    return fn(*args, **kwargs)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 112, in _try_acquire
    'exception': e,
RuntimeError: Unable to acquire lock on `b'/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/generated/mujocopy-buildlock'` due to [Errno 38] Function not implemented

geyang avatar Feb 22 '21 04:02 geyang

I am having the same issue on supercloud. It looks like it is due to the NFS missing the file lock function. My error is the following:

Traceback (most recent call last):
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 99, in _try_acquire
    self.trylock()
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 217, in trylock
    self._trylock(self.lockfile)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 250, in _trylock
    fcntl.lockf(lockfile, fcntl.LOCK_EX | fcntl.LOCK_NB)
OSError: [Errno 38] Function not implemented

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 145, in make
    return registry.make(id, **kwargs)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 90, in make
    env = spec.make(**kwargs)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 59, in make
    cls = load(self.entry_point)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 18, in load
    mod = importlib.import_module(mod_name)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/__init__.py", line 1, in <module>
    from gym.envs.mujoco.mujoco_env import MujocoEnv
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 12, in <module>
    import mujoco_py
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/__init__.py", line 3, in <module>
    from mujoco_py.builder import cymj, ignore_mujoco_warnings, functions, MujocoException
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/builder.py", line 510, in <module>
    cymj = load_cython_ext(mujoco_path)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/builder.py", line 89, in load_cython_ext
    with fasteners.InterProcessLock(lockpath):
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 179, in __enter__
    gotten = self.acquire()
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 161, in acquire
    gotten = r(self._try_acquire, blocking, watch)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/_utils.py", line 121, in __call__
    return fn(*args, **kwargs)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/fasteners/process_lock.py", line 112, in _try_acquire
    'exception': e,
RuntimeError: Unable to acquire lock on `b'/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/mujoco_py/generated/mujocopy-buildlock'` due to [Errno 38] Function not implemented

Hi @geyang, thanks for pointing out this issue. I was having the exact same issue on my end as well on Supercloud, and as you said, I believe the issue with running mujoco-py on supercloud is that the NFS is missing the file lock function. My solution was to install mujoco-py to /state/partition/user/mujoco-py (see the commands below from the bash script above) from my directory, and to run the setup commands here to build the package.

# Copy mujoco-py folder to locked part of cluster
cp -r ~/mujoco-py /state/partition1/user/$TMPFILE/    # You should first download mujoco-py to ~/ with git clone
cd /state/partition1/user/$TMPFILE/mujoco-py

# Now install it and import it to build
python3 setup.py install   # Run install commands inside the locked part of the server
python3 -c "import mujoco_py"  # This will set up the mujoco-py configuration

Please let me know if this helps - good luck!

rmsander avatar Feb 22 '21 13:02 rmsander

@geyang @rallen10 Another issue I had (specifically for supercloud) is that when I use the script above to run experiments, I cannot reliably run multiple tasks on the same node. To circumvent this, I am using concurrent experiments in tandem with LLMapReduce. If this is of interest to you, the scripts I used for this can be found below.

NOTE: This may have been a little "over-engineered", but I think it works better than trying to rebuild mujoco_py every time I run a new task on supercloud (which I found was the only way to avoid locking).

Python Script

I run concurrent experiments by making concurrent calls to my main python file. Here, I have created cluster_run.py, which parses my arguments from the given inputs of inputs.txt, and then calls my python script that runs mujoco_py.

"""Script to run multiple tune experiments concurrently, i.e. on the same node.
Calls custom_trainer.py with a given set of arguments."""

import os
import argparse

def parse_inputs():
    """Used as a CLI parser and argument formatter for use in inputting arguments
    to the custom_training.py file.py

    Returns:
        args_list (list):  A list of arguments.  Each element corresponds to the
            CLI parameters for the given call to custom_training.py.
    """
    # Create argument parser, add arguments, and parse them
    parser = argparse.ArgumentParser()
    parser.add_argument("-input_path", "--input_path", type=str,
                        help="File path location for input arguments")
    args = parser.parse_args()

    # Begin adding arguments
    args_list = []
    keys = ["seed", "env", "custom_replay_buffer", "agent_name",
            "round_robin_weights", "local_dir", "trainer"]
    default_args = ["use_delta", "gaussian_process", "gpytorch", "kneighbors 50",
                  "prioritized_replay", "mixup_interpolation", "train_size 1000",
                  "retrain_interval 1000", "kernel matern", "mean_type constant",
                  "matern_nu 1.5", "global_hyperparams", "use_ard"]

    # Parse arguments from input files
    with open(args.input_path, "r") as inputs:
        for i, line in enumerate(inputs):  # First line is additional arguments
            if i == 0:  # Additional arguments
                added_args = line
            else:

                # Creates string to store CLI config for custom_training.py
                args_list.append("")
                line_args = line.split(" ")

                # Adds arguments for parsed args
                for a, key in zip(line_args, keys):
                    args_list[-1] += "--{} {} ".format(key, a.strip())  # Stripping ensures no new lines are created

                # Adds arguments for default args
                for d in default_args:
                    args_list[-1] += "--{} ".format(d.strip())  # Stripping ensures no new lines are created

                # Adds final args that are applied to all in call
                args_list[-1] += added_args

        inputs.close()  # Close file

    return args_list


def call_experiments(args):
    """Function to stitch together arguments for custom_training.py into a single
    concurrent custom_training.py call.

    Parameters:
        args_list (list):  A list of arguments.  Each element corresponds to the
            CLI parameters for the given call to custom_training.py.
    """
    # Create list to store individual calls
    command_list = ["python3 custom_training.py {} & ".format(a.strip()) for a in args]
    command_str = ""  # Initialize command

    # String-concatenate single commands into concurrent command
    for c in command_list:
        command_str += c
    command_str = command_str[:-2]  # Remove final &+space

    # Add final formats
    command_str = "(" + command_str + ")"
    print("Command is: {}".format(command_str))


    # Call command to run concurrent experiments
    os.system(command_str)


def main():
    """Main invoked script.  Calls functions above to run concurrent experiments."""
    args_list = parse_inputs()  # Parse arguments and format
    call_experiments(args_list)  # Run experiments with parsed arguments


if __name__ == '__main__':
    main()

Run Script for LLMapReduce

To run LLMapReduce, you also need to create a run.sh script that takes parameters from an inputs.txt file and runs tasks with the parsed parameterization. This is largely the same before, just note that we call a different python script here.

#!/bin/bash

# Change to user directory
cd ~

# Activate conda environment
conda init bash
source ~/.bashrc
conda activate interreplay

# Make new folder there
TMPFILE=`mktemp XXXXXXXXXX`
mkdir /state/partition1/user/$TMPFILE

# Copy mujoco-py folder to locked part of cluster
cp -r ~/mujoco-py /state/partition1/user/$TMPFILE/
cd /state/partition1/user/$TMPFILE/mujoco-py

# Now install it and import it to build
python3 setup.py install
python3 -c "import mujoco_py"

# Now move code to this folder and mujoco-py into code
cp -r ~/interreplay /state/partition1/user/$TMPFILE/
cp -r mujoco_py ../interreplay/

# Change direcrory to interreplay
cd ../interreplay

# Run code!  (With parameters
python3 cluster_run.py --input_path $1

# Remove temporary directory
rm -rf /state/partition1/user/$TMPFILE

Inputs File for LLMapReduce

Finally, as aforementioned, for running LLMapReduce we need to have a file of input parameters to provide to the run.sh file (this is what mapper.sh is in charge of). In this case, I am just passing a set of input filenames for each task - each of these file names have the "experiment" parameters that will be placed on each node.

inputs_int_only_1.txt
inputs_int_only_2.txt
inputs_int_only_3.txt
inputs_ll_1.txt
inputs_ll_2.txt
inputs_ll_3.txt
inputs_vanilla_only_1.txt
inputs_vanilla_only_2.txt

Where each file in turn contains inputs, e.gf. for inputs_int_only_1.txt:

232 HalfCheetah-v2 True 232_k50_interp 5 ~/ll_tests/interp_only_updates SAC
243 HalfCheetah-v2 True 243_k50_interp 5 ~/ll_tests/interp_only_updates SAC

(Note that for the purposes of the script, the first blank line is actually important).

Putting It All Together

You can run this together with LLMapReduce in supercloud. Just make sure run.sh and inputs.txt are in the same directory, and make sure you have a mapper.sh file and that you have executable privileges for mapper.sh and run.sh (chmod +x mapper.sh run.sh). Then you can run:

LLMapReduce --mapper mapper.sh --input inputs.txt --output ~/<OUTDIR> --slotsPerTask 40 --np [4,1,1] --gpuNameCount=volta:2 --keep=true

Feel free to follow up, and hope this is helpful!

rmsander avatar Feb 22 '21 22:02 rmsander

For some reason I can run this on the login node @rmsander

python -c "import gym;img = gym.make('Reacher-v2').render('rgb_array');print(img.shape)"

But somehow when I run this on a worker node, it gives me the an error message:

Traceback (most recent call last):
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/jaynes/entry.py", line 11, in <module>
    fn(*args, **kwargs)
  File "/Users/ge/mit/dmc_gen/dmc_gen_analysis/__infra/launch_debug.py", line 13, in gym_render
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/core.py", line 233, in render
    return self.env.render(mode, **kwargs)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 145, in render
    self._get_viewer(mode).render(width, height, camera_id=camera_id)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 172, in _get_viewer
    self.viewer = mujoco_py.MjRenderContextOffscreen(self.sim, -1)
  File "mjrendercontext.pyx", line 45, in mujoco_py.cymj.MjRenderContext.__init__
  File "mjrendercontext.pyx", line 109, in mujoco_py.cymj.MjRenderContext._setup_opengl_context
ValueError: invalid literal for int() with base 10: 'GPU-ab488f30-aabb-f304-ddf1-875fa3ac7df9'
srun: error: d-13-12-1: task 0: Exited with exit code 1

This is after I run the following commands:

    LC_CTYPE=en_US.UTF-8 LANG=en_US.UTF-8 LANGUAGE=en_US
    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/gridsan/geyang/.mujoco/mujoco200/bin
    DMCGEN_DATA=$HOME/mit/dmc_gen/custom_vendor/data
  startup: >-
    source ~/.bashrc;
    module load cuda/11.0;
    module load anaconda/2020b;
    source activate dmcgen;
    cp -r /home/gridsan/$USER/mujoco-py /state/partition1/user/$USER;
    sleep 10;
    echo "finished copying";

geyang avatar Mar 04 '21 04:03 geyang

But somehow when I run this on a worker node, it gives me the an error message:

use the following on the worker node (or in submission script)

export CUDA_VISIBLE_DEVICES=0,1

bilkitty avatar Aug 18 '23 17:08 bilkitty