mlcube
mlcube copied to clipboard
Running MLBoxes on windows machines.
Docker and other MLCommons-Box runners assume they run in Linux environment. Several updates are required to support windows machines as well. Let's use this thread to track what is required and also document the process of running boxes on windows.
__How to run docker-based MLBoxes on Windows machines?
- Do this ...
- Do that ...
Fixed:
docker runcommand #134.
To be fixed:
docker inspectcommand that uses/dev/null. Error:
Seems like it should either be removed for windows platform (thatCould not find a part of the path 'C:\dev\null'/dev/null), or the docker runner needs to be able to figure out where it runs (cmd,power shell). Depending on environment, eitherNULor$nullare used.- The function that creates mount points needs to be updated. Currently, for file names the following is generated:
mounts: C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters: '/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters' - Paths on a command line need to be quoted.
Another error:
command issued for mnist example:
C:\mlperf\mlbox_11062020\box_examples\mnist> docker run --rm --net=host --privileged=true --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data:/mlbox_io0/data --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/download_logs:/mlbox_io1/download_logs serebrya/mlbox_mnist:0.0.2 download --data_dir=/mlbox_io0/data --log_dir=/mlbox_io1/download_logs
here is the error:
2020-11-10 16:58:42.772479: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory 2020-11-10 16:58:42.772697: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory 2020-11-10 16:58:42.772714: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
@hshaikusa These errors are OK. When no GPUs are available, TF should fall back to CPU compute backend. I see these messages on Linux machines as well.
@sergey-serebryakov , ok here is another error i am facing for mnist:
command: C:\mlperf\mlbox_11062020\box_examples\mnist> mlcommons_box_docker run --mlbox=. --platform=platforms/docker.yaml --task=run/train.yaml
outcome:
MLBox(root=C:\mlperf\mlbox_11062020\box_examples\mnist, name=mnist, version=0.1.0, task=MLBoxTask(inputs={'data_dir': 'directory', 'parameters_file': 'file'}, outputs={'log_dir': 'directory', 'model_dir': 'directory'}), invoke=MLBoxInvoke(task_name=train, input_binding={'data_dir': '$WORKSPACE/data', 'parameters_file': '$WORKSPACE/parameters/default.parameters.yaml'}, output_binding={'log_dir': '$WORKSPACE/train_logs', 'model_dir': '$WORKSPACE/model'}), platform=<mlcommons_box.common.objects.platform_config.PlatformConfig object at 0x0000015A78854F48>) docker inspect --type=image serebrya/mlbox_mnist:0.0.2 > /dev/null 2>&1 The system cannot find the path specified. Docker image (serebrya/mlbox_mnist:0.0.2) does not exist. Running 'configure' phase. docker pull serebrya/mlbox_mnist:0.0.2 0.0.2: Pulling from serebrya/mlbox_mnist Digest: sha256:75667646473cda957bd23b52b6f660fb462986d7776d323a654ae59269ce02b9 Status: Image is up to date for serebrya/mlbox_mnist:0.0.2 docker.io/serebrya/mlbox_mnist:0.0.2 mounts={'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data': '/mlbox_io0/data', 'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters': '/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters', 'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/train_logs': '/mlbox_io2/train_logs', 'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/model': '/mlbox_io3/model'}, args=['train', '--data_dir=/mlbox_io0/data', '--parameters_file=/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters/default.parameters.yaml', '--log_dir=/mlbox_io2/train_logs', '--model_dir=/mlbox_io3/model'] docker run --rm --net=host --privileged=true --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data:/mlbox_io0/data --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters:/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/train_logs:/mlbox_io2/train_logs --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/model:/mlbox_io3/model serebrya/mlbox_mnist:0.0.2 train --data_dir=/mlbox_io0/data --parameters_file=/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters/default.parameters.yaml --log_dir=/mlbox_io2/train_logs --model_dir=/mlbox_io3/model
docker: Error response from daemon: invalid mode: \mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters.
See 'docker run --help'.
Traceback (most recent call last):
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\envs\mlbox_11062020\Scripts\mlcommons_box_docker.exe_main.py", line 7, in
C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data:/mlbox_io0/data --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters:/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/train_logs:/mlbox_io2/train_logs --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/model:/mlbox_io3/model serebrya/mlbox_mnist:0.0.2 train --data_dir=/mlbox_io0/data --parameters_file=/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters/default.parameters.yaml --log_dir=/mlbox_io2/train_logs --model_dir=/mlbox_io3/model
@hshaikusa Thanks, there's one more issue to be fixed associated with how mount points are constructed. I updated the first message in this thread.
I cannot run docker on my win laptop (probably, due to McAfee). I asked our admins to allocate a Windows virtual instance that I can use for testing.
I think we might need to support Windows specific filepath construction. Probably a workaround for now (as we're working to stabilize the code) is to maybe use WSL and add instructions for that.
Update: I got access to Windows server and I could install docker. I should be able to provide a fix for Windows systems (local Docker runner) next week.
@sergey-serebryakov cool. looking forward to the fixes. please plan for them to push to PyPI once you are done with your level of validation. I would like them to validate as an outsider who can download as per the instructions and play with them.