yocto-gl [BUG] mlflow run with --backend kubernetes fails when local docker image with same IMAGE ID exists

[BUG] mlflow run with --backend kubernetes fails when local docker image with same IMAGE ID exists

Open AdemFr opened this issue 3 years ago • 2 comments

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

[ ] Yes. I can contribute a fix for this bug independently.
[ ] Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
[x] No. I cannot contribute a bug fix at this time.

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Catalina 10.15.4
MLflow installed from (source or binary): source
MLflow version (run mlflow --version): 1.10.0
Python version: 3.7.6
npm version, if running the dev UI: -
Exact command to reproduce:

Describe the problem

I would like to use the kubernetes deployment functionality for running training jobs. For this I have a MLproject file in my python package that looks something like this (I omitted details to make the mlflow tracking run properly, this works for now):

name: my-mlflow-project

docker_env:
  image: eu.gcr.io/my-gcloud-project/project_container_gpu

entry_points:
  main:
    command: "python trainer.py"

I would like to be able to run training locally as well as in kubernetes, just by changing the mlflow command arguments:

`mlflow run .``
mlflow run . --backend kubernetes --backend-config ...

Every time I run one of those commands, a docker image with the current code is build based on the base image provided in the MLproject file. The only difference is the resulting image repository name:

» docker images                                                                                                                               
REPOSITORY                                                  TAG                             IMAGE ID            CREATED             SIZE
my-mlflow-project                                    9d6d7a3                         e535a8039eec        2 hours ago         5.85GB
eu.gcr.io/my-gcloud-project/project_container_gpu    9d6d7a3                         e535a8039eec        2 hours ago         5.85GB

Note, that the TAG and IMAGE ID is exactly the same in both cases (because it's the same commit and I did not change any code locally).

Once I ran the command locally and try to run it with the kubernetes backend after, the pushing of the image to the container registry fails with this error:

2020/09/11 16:04:33 INFO mlflow.projects: === Building docker image eu.gcr.io/mxlabs-adem-pytorch-test/project_container_gpu:9d6d7a3 ===
2020/09/11 16:04:38 INFO mlflow.projects.kubernetes: === Pushing docker image mxlabs-adem-pytorch-test:9d6d7a3 ===
2020/09/11 16:04:41 ERROR mlflow.cli: === Error while pushing to docker registry: denied: requested access to the resource is denied ===

I can work around this by deleting the image with the REPOSITORY name of my_package_name (so only one image exists with the same docker IMAGE ID).

I looked into it a bit, and I am pretty sure that the reason for this is the usage of the first found image tag in this line where the first found image repository is chosen with image.tag[0].

This leads to it trying to push the image repository my_package_name and failing to "get access" to the non existent repository.

What component(s), interfaces, languages, and integrations does this bug affect?

Components

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[ ] area/models: MLmodel format, model serialization/deserialization, flavors
[x] area/projects: MLproject format, project running backends
[ ] area/scoring: Local serving, model deployment tools, spark UDFs
[ ] area/server-infra: MLflow server, JavaScript dev server
[ ] area/tracking: Tracking Service, tracking client APIs, autologging

Interface

[ ] area/uiux: Front-end, user experience, JavaScript, plotting
[x] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[ ] area/windows: Windows support

Language

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

Integrations

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

Sep 11 '20 14:09 AdemFr

Having the same issue.

2021/06/22 10:33:38 INFO mlflow.projects.docker: === Building docker image my.registry/train-something:093853c ===
2021/06/22 10:33:38 INFO mlflow.projects.kubernetes: === Pushing docker image train-something:093853c ===

The problem lies at https://github.com/mlflow/mlflow/blob/c88d6967470bd63138396e16a4c2617c018b6ced/mlflow/projects/init.py#L151 where the tag at index 0 not necessarily is the tag corresponding to the registry defined by the kube job config. Instead there should be a search for the right tag, probably using the backend_config 'repository-uri' as prefix indicator.

Jun 22 '21 08:06 tnte

still having this issue

Apr 10 '24 19:04 fschlz

yocto-gl yocto-gl copied to clipboard

[BUG] mlflow run with --backend kubernetes fails when local docker image with same IMAGE ID exists

Willingness to contribute

System information

Describe the problem

What component(s), interfaces, languages, and integrations does this bug affect?

yocto-gl
yocto-gl copied to clipboard