mlcube icon indicating copy to clipboard operation
mlcube copied to clipboard

Running MLBoxes on windows machines.

Open sergey-serebryakov opened this issue 5 years ago • 7 comments

Docker and other MLCommons-Box runners assume they run in Linux environment. Several updates are required to support windows machines as well. Let's use this thread to track what is required and also document the process of running boxes on windows.

__How to run docker-based MLBoxes on Windows machines?

  • Do this ...
  • Do that ...

Fixed:

  • docker run command #134.

To be fixed:

  • docker inspect command that uses /dev/null. Error:
    Could not find a part of the path 'C:\dev\null'
    
    Seems like it should either be removed for windows platform (that /dev/null), or the docker runner needs to be able to figure out where it runs (cmd, power shell). Depending on environment, either NUL or $null are used.
  • The function that creates mount points needs to be updated. Currently, for file names the following is generated:
    mounts:
        C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters: '/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters'
    
  • Paths on a command line need to be quoted.

sergey-serebryakov avatar Nov 10 '20 18:11 sergey-serebryakov

Another error:

command issued for mnist example:

C:\mlperf\mlbox_11062020\box_examples\mnist> docker run --rm --net=host --privileged=true --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data:/mlbox_io0/data --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/download_logs:/mlbox_io1/download_logs serebrya/mlbox_mnist:0.0.2 download --data_dir=/mlbox_io0/data --log_dir=/mlbox_io1/download_logs

here is the error:

2020-11-10 16:58:42.772479: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory 2020-11-10 16:58:42.772697: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory 2020-11-10 16:58:42.772714: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

hshaikusa avatar Nov 11 '20 22:11 hshaikusa

@hshaikusa These errors are OK. When no GPUs are available, TF should fall back to CPU compute backend. I see these messages on Linux machines as well.

sergey-serebryakov avatar Nov 12 '20 05:11 sergey-serebryakov

@sergey-serebryakov , ok here is another error i am facing for mnist:

command: C:\mlperf\mlbox_11062020\box_examples\mnist> mlcommons_box_docker run --mlbox=. --platform=platforms/docker.yaml --task=run/train.yaml

outcome:

MLBox(root=C:\mlperf\mlbox_11062020\box_examples\mnist, name=mnist, version=0.1.0, task=MLBoxTask(inputs={'data_dir': 'directory', 'parameters_file': 'file'}, outputs={'log_dir': 'directory', 'model_dir': 'directory'}), invoke=MLBoxInvoke(task_name=train, input_binding={'data_dir': '$WORKSPACE/data', 'parameters_file': '$WORKSPACE/parameters/default.parameters.yaml'}, output_binding={'log_dir': '$WORKSPACE/train_logs', 'model_dir': '$WORKSPACE/model'}), platform=<mlcommons_box.common.objects.platform_config.PlatformConfig object at 0x0000015A78854F48>) docker inspect --type=image serebrya/mlbox_mnist:0.0.2 > /dev/null 2>&1 The system cannot find the path specified. Docker image (serebrya/mlbox_mnist:0.0.2) does not exist. Running 'configure' phase. docker pull serebrya/mlbox_mnist:0.0.2 0.0.2: Pulling from serebrya/mlbox_mnist Digest: sha256:75667646473cda957bd23b52b6f660fb462986d7776d323a654ae59269ce02b9 Status: Image is up to date for serebrya/mlbox_mnist:0.0.2 docker.io/serebrya/mlbox_mnist:0.0.2 mounts={'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data': '/mlbox_io0/data', 'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters': '/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters', 'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/train_logs': '/mlbox_io2/train_logs', 'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/model': '/mlbox_io3/model'}, args=['train', '--data_dir=/mlbox_io0/data', '--parameters_file=/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters/default.parameters.yaml', '--log_dir=/mlbox_io2/train_logs', '--model_dir=/mlbox_io3/model'] docker run --rm --net=host --privileged=true --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data:/mlbox_io0/data --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters:/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/train_logs:/mlbox_io2/train_logs --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/model:/mlbox_io3/model serebrya/mlbox_mnist:0.0.2 train --data_dir=/mlbox_io0/data --parameters_file=/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters/default.parameters.yaml --log_dir=/mlbox_io2/train_logs --model_dir=/mlbox_io3/model

docker: Error response from daemon: invalid mode: \mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters. See 'docker run --help'. Traceback (most recent call last): File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\runpy.py", line 85, in run_code exec(code, run_globals) File "C:\ProgramData\Anaconda3\envs\mlbox_11062020\Scripts\mlcommons_box_docker.exe_main.py", line 7, in File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 829, in call return self.main(*args, **kwargs) File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 1259, in invoke return process_result(sub_ctx.command.invoke(sub_ctx)) File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 610, in invoke return callback(*args, **kwargs) File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\mlcommons_box_docker_main.py", line 45, in run runner.run() File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\mlcommons_box_docker\docker_run.py", line 72, in run self._run_or_die(cmd) File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\mlcommons_box_docker\docker_run.py", line 117, in _run_or_die raise RuntimeError('Command failed: {}'.format(cmd)) RuntimeError: Command failed: docker run --rm --net=host --privileged=true --volume

C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data:/mlbox_io0/data --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters:/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/train_logs:/mlbox_io2/train_logs --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/model:/mlbox_io3/model serebrya/mlbox_mnist:0.0.2 train --data_dir=/mlbox_io0/data --parameters_file=/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters/default.parameters.yaml --log_dir=/mlbox_io2/train_logs --model_dir=/mlbox_io3/model

hshaikusa avatar Nov 12 '20 21:11 hshaikusa

@hshaikusa Thanks, there's one more issue to be fixed associated with how mount points are constructed. I updated the first message in this thread.

I cannot run docker on my win laptop (probably, due to McAfee). I asked our admins to allocate a Windows virtual instance that I can use for testing.

sergey-serebryakov avatar Nov 13 '20 15:11 sergey-serebryakov

I think we might need to support Windows specific filepath construction. Probably a workaround for now (as we're working to stabilize the code) is to maybe use WSL and add instructions for that.

swiftdiaries avatar Nov 16 '20 07:11 swiftdiaries

Update: I got access to Windows server and I could install docker. I should be able to provide a fix for Windows systems (local Docker runner) next week.

sergey-serebryakov avatar Nov 23 '20 07:11 sergey-serebryakov

@sergey-serebryakov cool. looking forward to the fixes. please plan for them to push to PyPI once you are done with your level of validation. I would like them to validate as an outsider who can download as per the instructions and play with them.

hshaikusa avatar Nov 23 '20 18:11 hshaikusa