yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG] mlflow run with --backend kubernetes fails when local docker image with same IMAGE ID exists

Open AdemFr opened this issue 3 years ago • 2 comments

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • [ ] Yes. I can contribute a fix for this bug independently.
  • [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • [x] No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Catalina 10.15.4
  • MLflow installed from (source or binary): source
  • MLflow version (run mlflow --version): 1.10.0
  • Python version: 3.7.6
  • npm version, if running the dev UI: -
  • Exact command to reproduce:

Describe the problem

I would like to use the kubernetes deployment functionality for running training jobs. For this I have a MLproject file in my python package that looks something like this (I omitted details to make the mlflow tracking run properly, this works for now):

name: my-mlflow-project

docker_env:
  image: eu.gcr.io/my-gcloud-project/project_container_gpu

entry_points:
  main:
    command: "python trainer.py"

I would like to be able to run training locally as well as in kubernetes, just by changing the mlflow command arguments:

  1. `mlflow run .``
  2. mlflow run . --backend kubernetes --backend-config ...

Every time I run one of those commands, a docker image with the current code is build based on the base image provided in the MLproject file. The only difference is the resulting image repository name:

» docker images                                                                                                                               
REPOSITORY                                                  TAG                             IMAGE ID            CREATED             SIZE
my-mlflow-project                                    9d6d7a3                         e535a8039eec        2 hours ago         5.85GB
eu.gcr.io/my-gcloud-project/project_container_gpu    9d6d7a3                         e535a8039eec        2 hours ago         5.85GB

Note, that the TAG and IMAGE ID is exactly the same in both cases (because it's the same commit and I did not change any code locally).

Once I ran the command locally and try to run it with the kubernetes backend after, the pushing of the image to the container registry fails with this error:

2020/09/11 16:04:33 INFO mlflow.projects: === Building docker image eu.gcr.io/mxlabs-adem-pytorch-test/project_container_gpu:9d6d7a3 ===
2020/09/11 16:04:38 INFO mlflow.projects.kubernetes: === Pushing docker image mxlabs-adem-pytorch-test:9d6d7a3 ===
2020/09/11 16:04:41 ERROR mlflow.cli: === Error while pushing to docker registry: denied: requested access to the resource is denied ===

I can work around this by deleting the image with the REPOSITORY name of my_package_name (so only one image exists with the same docker IMAGE ID).

I looked into it a bit, and I am pretty sure that the reason for this is the usage of the first found image tag in this line where the first found image repository is chosen with image.tag[0].

This leads to it trying to push the image repository my_package_name and failing to "get access" to the non existent repository.

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [ ] area/docs: MLflow documentation pages
  • [ ] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [x] area/projects: MLproject format, project running backends
  • [ ] area/scoring: Local serving, model deployment tools, spark UDFs
  • [ ] area/server-infra: MLflow server, JavaScript dev server
  • [ ] area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • [ ] area/uiux: Front-end, user experience, JavaScript, plotting
  • [x] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

Language

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

Integrations

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

AdemFr avatar Sep 11 '20 14:09 AdemFr

Having the same issue.

2021/06/22 10:33:38 INFO mlflow.projects.docker: === Building docker image my.registry/train-something:093853c ===
2021/06/22 10:33:38 INFO mlflow.projects.kubernetes: === Pushing docker image train-something:093853c ===

The problem lies at https://github.com/mlflow/mlflow/blob/c88d6967470bd63138396e16a4c2617c018b6ced/mlflow/projects/init.py#L151 where the tag at index 0 not necessarily is the tag corresponding to the registry defined by the kube job config. Instead there should be a search for the right tag, probably using the backend_config 'repository-uri' as prefix indicator.

tnte avatar Jun 22 '21 08:06 tnte

still having this issue

fschlz avatar Apr 10 '24 19:04 fschlz