yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG] RESOURCE_DOES_NOT_EXIST when mlflow call start_run()

Open Jakubelo opened this issue 3 years ago • 18 comments

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • [ ] Yes. I can contribute a fix for this bug independently.
  • [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • [X] No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 - AWS EC2
  • MLflow installed from (source or binary): conda
  • MLflow version (run mlflow --version): mlflow, version 1.20.2
  • Python version: 3.6.9
  • npm version, if running the dev UI:
  • Exact command to reproduce: mlflow.start_run()

Describe the problem

I have remote tracking server (the access policies for EC2 to server are setted correct, but I'm not sure at 100%). I have a main run (parent), and under that parent I also have a few child runs. The issue is related to first start_run() (parent run). When the script calls with mlflow.start_run(), script crashes.

The resposne from server calls: RESOURCE_DOES_NOT_EXIST when looking for run_id

Code to reproduce issue

remote_server_uri = "http://x.x.x.x:xxxx" # set to your server URI
    mlflow.set_tracking_uri(remote_server_uri)
    mlflow.set_experiment('/cargo_movement')
    # You can get the path at the root of the MLflow project with this:
    root_path = os.path.abspath('.')

    # Check which steps we need to execute
    if isinstance(config["main"]["execute_steps"], str):
        # This was passed on the command line as a comma-separated list of steps
        steps_to_execute = config["main"]["execute_steps"].split(",")
    else:

        steps_to_execute = list(config["main"]["execute_steps"])
    
    with mlflow.start_run() as parent_run:
        # Download step
        if "1_download" in steps_to_execute:

            _ = mlflow.run(
                os.path.join(root_path, "1_download"),
                "main",
                parameters={
                    "parent_run_id": parent_run.info.run_id,
                }
            )
        ...

Other info / logs

$ mlflow run .
2021/09/18 13:30:47 INFO mlflow.projects.utils: === Created directory /tmp/tmpy661fhzb for downloading remote URIs passed to arguments of type 'path' ===
2021/09/18 13:30:47 INFO mlflow.projects.backend.local: === Running command 'source /home/ubuntu/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-167823303a9c0913bc4240ea63b3cb92329b0538 1>&2 && python main.py' in run with ID 'f7b8bafb58404dcb8e27ae1b901b2524' === 
ENV VAR: f7b8bafb58404dcb8e27ae1b901b2524
Traceback (most recent call last):
  File "/home/ubuntu/fchardnet/main.py", line 109, in <module>
    go(config)
  File "/home/ubuntu/fchardnet/main.py", line 25, in go
    with mlflow.start_run() as parent_run:
  File "/home/ubuntu/anaconda3/envs/mlflow-167823303a9c0913bc4240ea63b3cb92329b0538/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 204, in start_run
    active_run_obj = client.get_run(existing_run_id)
  File "/home/ubuntu/anaconda3/envs/mlflow-167823303a9c0913bc4240ea63b3cb92329b0538/lib/python3.9/site-packages/mlflow/tracking/client.py", line 150, in get_run
    return self._tracking_client.get_run(run_id)
  File "/home/ubuntu/anaconda3/envs/mlflow-167823303a9c0913bc4240ea63b3cb92329b0538/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 65, in get_run
    return self.store.get_run(run_id)
  File "/home/ubuntu/anaconda3/envs/mlflow-167823303a9c0913bc4240ea63b3cb92329b0538/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 132, in get_run
    response_proto = self._call_endpoint(GetRun, req_body)
  File "/home/ubuntu/anaconda3/envs/mlflow-167823303a9c0913bc4240ea63b3cb92329b0538/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 56, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/home/ubuntu/anaconda3/envs/mlflow-167823303a9c0913bc4240ea63b3cb92329b0538/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 217, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/home/ubuntu/anaconda3/envs/mlflow-167823303a9c0913bc4240ea63b3cb92329b0538/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 169, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_DOES_NOT_EXIST: Run with id=f7b8bafb58404dcb8e27ae1b901b2524 not found
2021/09/18 13:30:48 ERROR mlflow.cli: === Run (ID 'f7b8bafb58404dcb8e27ae1b901b2524') failed ===

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [ ] area/docs: MLflow documentation pages
  • [X] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [X] area/projects: MLproject format, project running backends
  • [ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [ ] area/server-infra: MLflow Tracking server backend
  • [X] area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • [ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

Language

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

Integrations

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

Jakubelo avatar Sep 18 '21 14:09 Jakubelo

When calling script without mlflow.set_tracking_uri(remote_server_uri) then i get:

mlflow run .
2021/09/18 14:06:55 INFO mlflow.projects.utils: === Created directory /tmp/tmphpullqjs for downloading remote URIs passed to arguments of type 'path' ===
2021/09/18 14:06:55 INFO mlflow.projects.backend.local: === Running command 'source /home/ubuntu/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-167823303a9c0913bc4240ea63b3cb92329b0538 1>&2 && python main.py' in run with ID 'ac0582aec6a44f19899f5dfcba02cc39' === 
INFO: 'cargo_movement' does not exist. Creating a new experiment
ENV VAR: ac0582aec6a44f19899f5dfcba02cc39
Traceback (most recent call last):
  File "/home/ubuntu/fchardnet/main.py", line 109, in <module>
    go(config)
  File "/home/ubuntu/fchardnet/main.py", line 25, in go
    with mlflow.start_run() as parent_run:
  File "/home/ubuntu/anaconda3/envs/mlflow-167823303a9c0913bc4240ea63b3cb92329b0538/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 210, in start_run
    raise MlflowException(
mlflow.exceptions.MlflowException: Cannot start run with ID ac0582aec6a44f19899f5dfcba02cc39 because active run ID does not match environment run ID. Make sure --experiment-name or --experiment-id matches experiment set with set_experiment(), or just use command-line arguments
2021/09/18 14:06:56 ERROR mlflow.cli: === Run (ID 'ac0582aec6a44f19899f5dfcba02cc39') failed ===

Jakubelo avatar Sep 18 '21 14:09 Jakubelo

Is there any update ? I encounter the same issue.

fpaupier avatar Nov 19 '21 21:11 fpaupier

Same issue. However, it does create the run if I do not explicitly use start_run and just start fitting my model.

metalglove avatar Nov 21 '21 16:11 metalglove

I figured out that running from terminal mlflow run <dir> is creating the run ID, so when you don't (even you shouldn't, due to this exception) have to create parent run. Also tracking URI has a similar issue, when you call mlflow run <dir> you have to set tracking URI as ENV VAR before, because when you try to do it by module in python .set_tracking_uri(foo_uri) (something like this) is too late. Why is too late, because the run is already created locally (you will see catalog mlruns in working dir).

So the best option is based on ENV vars and mlflow run started from terminal OR running script by python and creating run id and setting tracking uri in python.

Jakubelo avatar Nov 21 '21 16:11 Jakubelo

Transforming the start_run into this fixed it for me:

client = MlflowClient()
run = client.create_run(experiment.experiment_id)
run = mlflow.start_run(run_id = run.info.run_id)
...
mlflow.end_run()

metalglove avatar Dec 21 '21 12:12 metalglove

Unfortunately, the solution I provided above does not provide all the other automatically tracked parameters such as git version, source file, executing user, etc.

metalglove avatar Jan 25 '22 12:01 metalglove

Is there any work around for this? Does this mean that there is no way to have a remote mlflow tracking server?

tdnguyen6 avatar Feb 28 '22 10:02 tdnguyen6

The error messages are just a little unhelpful. this type of initialization works for me: mlflow.set_experiment('hello') with mlflow.start_run(run_name='tnet_e3nn_reg'): log_param("param1", randint(0, 100))

jparkhilltx avatar Mar 02 '22 18:03 jparkhilltx

Hi, is there any work around this?

mithrandir184 avatar Mar 22 '22 04:03 mithrandir184

Hi, is there any work around this?

I have same problem, I think the current solution is to remove the python api mlflow.start_run() and manually add experiment name when you run this command.

mlflow run . --experiment_name="some-experiment-name" --tracking_uri="some-tracking-uri" 

or you can set the environment variable for experiment_name and tracking_uri.

Clayrisee avatar Apr 01 '22 06:04 Clayrisee

mlflow run . --experiment_name="some-experiment-name" --tracking_uri="some-tracking-uri"

There is no parameter called --tracking_uri The parameter --experiment_name should be --experiment-name Unfortunately this does not work for me, I tried to remove with mlflow.start_run() and keep mlflow.set_tracking_uri() in the code

ArtificialTruth avatar Apr 29 '22 08:04 ArtificialTruth

mlflow run . --experiment_name="some-experiment-name" --tracking_uri="some-tracking-uri"

There is no parameter called --tracking_uri The parameter --experiment_name should be --experiment-name Unfortunately this does not work for me, I tried to remove with mlflow.start_run() and keep mlflow.set_tracking_uri() in the code

Well, you can try this step.

  1. Export MLFLow Tracking Server variable like this code below.
export MLFLOW_TRACKING_URI=your_tracking_uri
export MLFLOW_EXPERIMENT_NAME="your_experiment_name"
  1. Run your MLflow Project with this command line.
mlflow run [your/where/MLproject Folder] --no-conda # if you don't want use conda env

Notes:

  • You must remove mlflow.start_run() in your python code, if you don't remove this line it will create 2 running experiments and create errors
  • You don't have to use mlflow.set_tracking_uri(), because it is already set in your environment variables.

Hope it will work for you!

Clayrisee avatar May 09 '22 01:05 Clayrisee

I'm using a remote tracking server and I initially had the same problems running a project using mlflow run . from the CLI. When I tried to use the with mlflow start_run command with my normal script setup, I also got a RESOURCE_DOES_NOT_EXIST error. when I used run = client.create_run(experiment_id) run = mlflow.start_run(run_id = run.info.run_id) suggested by @metalglove I end up kicking off 2 runs which is why the git commit id doesn't show up, it is only present for the first run. When I tried the solution proposed by @Clayrisee after removing mlflow.start_run() everything was logged to one remotely tracked run (as desired). To get the same result, you can use the sdk and run the project using a python script with the mlflow.projects.run() command. If you're working from a windows machine, you'll want to make sure you're using the latest version of mlflow (1.27.0) or you might run into this error. As and edit to that recommendation, I would pass the experiment name to the mlflow run or sdk method rather than declare it as an environment variable. I played around with it some more and found that the main issue was that I was also trying to create/set an experiment at the top of my script (as one often sees outside of using mlflow projects). When I removed that and added back with mlflow.start_run() everything worked with both the CLI command and the SDK and only one run seems to have been created. It is buggy and was a little confusing at first. The main issue that I have is that you have to change how/where you set your experiment if you're using projects vs. running a stand-alone python script. I also don't know yet how to specify a specific artifact location. Right now I'm logging everything to S3 and it's automatically creating a directory named with the experiment_id. When creating my experiment in a stand-alone scripts, I was able to give this folder a meaningful name. But I might just not have found how to do this yet with projects.

nfarley-soaren avatar Aug 01 '22 23:08 nfarley-soaren

Run using python /path/to/file.py and the Python API of MLFlow will work. Running mlflow run and having Python API do not work nicely together.

ghost avatar Aug 19 '22 20:08 ghost

Avoid calling mlflow.get_artifact_uri() twice

danielAdama avatar Mar 14 '23 15:03 danielAdama

@ghost That isn't working for me within one of my scripts. It is with another :( I don't know what the issue is.

joann-alvarez avatar Jun 29 '23 16:06 joann-alvarez

Hi all! I have this error when I run "copy-model-version" in a bash script:

mlflow.exceptions.RestException: RESOURCE_DOES_NOT_EXIST: Run '<ID>' not found.

I'm able to copy the model, but I have this error, have we got a workaround to solve this issue?

Sennar19 avatar Apr 04 '24 22:04 Sennar19