yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[HELP WANTED][BUG] Can't find Docker for multistep projects

Open Zethson opened this issue 4 years ago • 10 comments

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • [ ] Yes. I can contribute a fix for this bug independently.
  • [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • [x] No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): Yes,
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Arch Linux latest, but using Docker here
  • MLflow installed from (source or binary): Docker
  • MLflow version (run mlflow --version): latest
  • Python version: 3.7 something
  • Exact command to reproduce: mlflow run .

Describe the problem

Multistep workflow with Docker runs into mlflow.exceptions.ExecutionException: Could not find Docker executable. . Docker is clearly installed and should be available, since the run launched successfully and even reused the cached load_raw_data. However, subsequent entrypoints run into the exception.

Code to reproduce issue

https://github.com/Zethson/mlflow_custom_ms_example

This is a very slightly adapted version of the custom multistep example. Please build the Docker container as custom_ms_example and then simply run the project with the usual mlflow run . Please be aware, that you may run into subsequent errors such as missing JAVA_HOME or something, since the container may not be complete yet, but at this point it does not get to this stage!

Other info / logs

zeth@master ~/P/custom_multistep [1]> mlflow run .                                                                                                                                     (base) 
2020/05/18 16:50:38 INFO mlflow.projects: === Building docker image multistep_example ===
2020/05/18 16:50:54 INFO mlflow.projects: === Created directory /tmp/tmpxwoywhi1 for downloading remote URIs passed to arguments of type 'path' ===
2020/05/18 16:50:54 INFO mlflow.projects: === Running command 'docker run --rm -v /home/zeth/PycharmProjects/custom_multistep/mlruns:/mlflow/tmp/mlruns -v /home/zeth/PycharmProjects/mlflow/examples/multistep_workflow/mlruns/0/d588d7bc4a174c8bb066748faeb88c5e/artifacts:/home/zeth/PycharmProjects/mlflow/examples/multistep_workflow/mlruns/0/d588d7bc4a174c8bb066748faeb88c5e/artifacts -e MLFLOW_RUN_ID=d588d7bc4a174c8bb066748faeb88c5e -e MLFLOW_TRACKING_URI=file:///mlflow/tmp/mlruns -e MLFLOW_EXPERIMENT_ID=0 multistep_example:latest python main.py --als-max-iter 10 --keras-hidden-units 20 --max-row-limit 100000' in run with ID 'd588d7bc4a174c8bb066748faeb88c5e' === 
Run matched, but has a different source version, so skipping (found=142abbbd6dbc3a9879854f8356f2d7e7d3270729, expected=None)
No matching run has been found.
Found existing run for entrypoint=load_raw_data and parameters={}
Launching new run for entrypoint=etl_data and parameters={'ratings_csv': 'file:///home/zeth/PycharmProjects/mlflow/examples/multistep_workflow/mlruns/0/ed8ba88063bc4ac8acd41a6ddf5bf8b7/artifacts/ratings-csv-dir', 'max_row_limit': 100000}
Traceback (most recent call last):
  File "/opt/conda/envs/multistep/lib/python3.7/site-packages/mlflow/projects/__init__.py", line 700, in _validate_docker_installation
    process.exec_cmd([docker_path, "--help"], throw_on_error=False)
  File "/opt/conda/envs/multistep/lib/python3.7/site-packages/mlflow/utils/process.py", line 43, in exec_cmd
    cwd=cwd, universal_newlines=True, **kwargs)
  File "/opt/conda/envs/multistep/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/opt/conda/envs/multistep/lib/python3.7/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'docker': 'docker'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 105, in <module>
    workflow()
  File "/opt/conda/envs/multistep/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/multistep/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/multistep/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/multistep/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 86, in workflow
    git_commit)
  File "main.py", line 67, in _get_or_run
    submitted_run = mlflow.run(".", entrypoint, parameters=parameters)
  File "/opt/conda/envs/multistep/lib/python3.7/site-packages/mlflow/projects/__init__.py", line 291, in run
    synchronous=synchronous, run_id=run_id)
  File "/opt/conda/envs/multistep/lib/python3.7/site-packages/mlflow/projects/__init__.py", line 150, in _run
    _validate_docker_installation()
  File "/opt/conda/envs/multistep/lib/python3.7/site-packages/mlflow/projects/__init__.py", line 702, in _validate_docker_installation
    raise ExecutionException("Could not find Docker executable. "
mlflow.exceptions.ExecutionException: Could not find Docker executable. Ensure Docker is installed as per the instructions at https://docs.docker.com/install/overview/.
2020/05/18 16:50:56 ERROR mlflow.cli: === Run (ID 'd588d7bc4a174c8bb066748faeb88c5e') failed ===

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [ ] area/docs: MLflow documentation pages
  • [x] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [x] area/projects: MLproject format, project running backends
  • [ ] area/scoring: Local serving, model deployment tools, spark UDFs
  • [ ] area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • [ ] area/uiux: Front-end, user experience, JavaScript, plotting
  • [x] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

Language

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients

Integrations

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations

Zethson avatar May 18 '20 15:05 Zethson

I would like to add that the container works in a non multistep setting.

Zethson avatar May 18 '20 15:05 Zethson

@Zethson thanks for filing this, just to confirm - it looks like the "docker not found" exception is raised from within your docker container (i.e. within multistep_example:latest). If you run & log into the docker container via docker exec -it multistep_example:latest bash, is the docker executable present in the resulting container?

My suspicion is that the problem is that docker is not installed within multistep_example:latest (in general, invoking docker commands from within a docker container is a bit tricky, so happy to brainstorm on how to make this easier if that suspicion turns out to be correct)

smurching avatar May 18 '20 23:05 smurching

Dear @smurching,

thank you for your swift response. Docker is not installed inside the Docker container and it's absolutely not supposed to be. My expectation is that every step of the multistep workflow is executed inside the (single) Docker container. I am not using some weird custom code, but am solely running your mlflow multistep example with a single Docker container.

Do you have a multistep mlflow example, which runs via a Docker container?

(I am well aware that the current multistep solution is temporary and will be replaced at the end of the year with a reasonable DAG solution, but for now I would like to get this to work well at least).

Zethson avatar May 19 '20 08:05 Zethson

@Zethson I see two potential solutions to the problem:

  1. Introduce a --no-docker option, which will allow for running entry points without trying to create a new docker container. You could use this option to run individual steps in your multistep workflow without trying to create a nested docker container

  2. Attempt to mount the host's docker socket when running docker containers for MLflow project execution as described in the StackOverflow post.

I think both of these would unblock your use case, but require code changes to MLflow. It might be possible to achieve 2) without code changes, I'll investigate. In general, I think I prefer solution 2, as running multistep docker projects would "just work", but it'd require some investigation (i.e. is it always possible to mount the docker socket / identify where it is on the host machine in a platform-independent way)?

smurching avatar May 22 '20 23:05 smurching

Thanks!

I would also like to suggest solution 2, since it would play far more nicely with proposal https://github.com/mlflow/mlflow/issues/2850 .

Zethson avatar May 23 '20 08:05 Zethson

Hi folks, I've added the help wanted label to this issue. It would be great to put together a PR that leverages Docker's -v flag to create sibling containers for multi-step Docker project workflows.

dbczumar avatar Jul 06 '20 23:07 dbczumar

I should say that I am running into a similar issue, and a docker run -v option would be great

LarsDu avatar Sep 17 '20 22:09 LarsDu

Hi all, I have created a docker multistep project example based on the multistep_workflow one. It would be great if you could replicate it and validate my approach before doing a PR for that. You can find my example here.

In the example, volumes are set to execute docker within the container and to have the artifacts available for every new container created. You can find instructions to replicate the example in the README.

symeneses avatar Oct 18 '20 09:10 symeneses

@dbczumar I can create a PR if someone from the community could check the example I created ☝🏽 which is functional and provide me some guidance. There, I am using volumes set in the MLproject file to execute docker inside the docker and share the mlruns folder.

symeneses avatar Feb 23 '21 11:02 symeneses

Hi all, I have created a docker multistep project example based on the multistep_workflow one. It would be great if you could replicate it and validate my approach before doing a PR for that. You can find my example here.

In the example, volumes are set to execute docker within the container and to have the artifacts available for every new container created. You can find instructions to replicate the example in the README.

Thank you @symeneses for finding this workaround. I needed to add /usr/bin/docker:/usr/bin/docker volume mount to get it working.

prouast avatar Sep 21 '22 10:09 prouast