yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG] mlflow gc do not remove deleted experiments when runs were tracked with datasets

Open Mlokos opened this issue 1 year ago • 3 comments

Issues Policy acknowledgement

  • [X] I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Local machine

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

  • Client: 2.9.2
  • Tracking server: 2.9.2

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.3 LTS
  • Python version: Python 3.10.12
  • yarn version, if running the dev UI: -

Describe the problem

"mlflow gc" command do not remove deleted experiments when runs were tracked with datasets

Tracking information

It was a default setup from "Remote Experiment Tracking with MLflow Tracking Server" scenario: https://mlflow.org/docs/latest/tracking/tutorials/remote-server.html

  1. When created, I have run a sample code (attached in below section - "Code to reproduce issue") to generate a "test" experiment.
  2. Then, I have deleted it with the web UI.
  3. After that I have run below commands:
export MLFLOW_TRACKING_URI="http://localhost:5000"
mlflow gc --backend-store-uri="postgresql://user:password@localhost:5432/mlflowdb"

Code to reproduce issue

import mlflow

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("test")
mlflow.autolog()

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

# Create and train models.
rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)

Stack trace

mlflow gc --backend-store-uri="postgresql://user:password@localhost:5432/mlflowdb"
Run with ID 6fb6bb680a2249388ec8346febc6540c has been permanently deleted.
Traceback (most recent call last):
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
    self.dialect.do_execute(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.ForeignKeyViolation: update or delete on table "experiments" violates foreign key constraint "datasets_experiment_id_fkey" on table "datasets"
DETAIL:  Key (experiment_id)=(1) is still referenced from table "datasets".


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/mlflow/store/db/utils.py", line 142, in make_managed_session
    session.commit()
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1969, in commit
    trans.commit(_to_root=True)
  File "<string>", line 2, in commit
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go
    ret_value = fn(self, *arg, **kw)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1256, in commit
    self._prepare_impl()
  File "<string>", line 2, in _prepare_impl
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go
    ret_value = fn(self, *arg, **kw)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1231, in _prepare_impl
    self.session.flush()
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 4312, in flush
    self._flush(objects)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 4447, in _flush
    with util.safe_reraise():
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 4408, in _flush
    flush_context.execute()
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 466, in execute
    rec.execute(self)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 679, in execute
    util.preloaded.orm_persistence.delete_obj(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 191, in delete_obj
    _emit_delete_statements(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 1456, in _emit_delete_statements
    c = connection.execute(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1416, in execute
    return meth(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 516, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1848, in _execute_context
    return self._exec_single_context(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1988, in _exec_single_context
    self._handle_dbapi_exception(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2343, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
    self.dialect.do_execute(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.ForeignKeyViolation) update or delete on table "experiments" violates foreign key constraint "datasets_experiment_id_fkey" on table "datasets"
DETAIL:  Key (experiment_id)=(1) is still referenced from table "datasets".

[SQL: DELETE FROM experiments WHERE experiments.experiment_id = %(experiment_id)s]
[parameters: {'experiment_id': 1}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mlokos/dev/mlflow/venv/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/mlflow/cli.py", line 630, in gc
    backend_store._hard_delete_experiment(experiment_id)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/mlflow/store/tracking/sqlalchemy_store.py", line 421, in _hard_delete_experiment
    with self.ManagedSessionMaker() as session:
  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/mlflow/store/db/utils.py", line 155, in make_managed_session
    raise MlflowException(message=e, error_code=BAD_REQUEST)
mlflow.exceptions.MlflowException: (psycopg2.errors.ForeignKeyViolation) update or delete on table "experiments" violates foreign key constraint "datasets_experiment_id_fkey" on table "datasets"
DETAIL:  Key (experiment_id)=(1) is still referenced from table "datasets".

[SQL: DELETE FROM experiments WHERE experiments.experiment_id = %(experiment_id)s]
[parameters: {'experiment_id': 1}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

Other info / logs

The issue is solely connected with datasets-table-handling. I was able to make "mlflow gc" command work, but had to manually remove rows form that table.

psql -d mlflowdb -U user --password -h localhost -p 5432
truncate TABLE datasets; # inside psql cli

After that the "mlflow gc" command was able to remove data from PostgreSQL and artifacts from minIO.

export MLFLOW_TRACKING_URI="http://localhost:5000"
mlflow gc --backend-store-uri="postgresql://user:password@localhost:5432/mlflowdb"
Experiment with ID 1 has been permanently deleted.

What component(s) does this bug affect?

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [X] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • [ ] area/docs: MLflow documentation pages
  • [X] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • [ ] area/projects: MLproject format, project running backends
  • [ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [X] area/server-infra: MLflow Tracking server backend
  • [X] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • [ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [X] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

What language(s) does this bug affect?

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

Mlokos avatar Dec 14 '23 21:12 Mlokos

Thanks for the report! I can reproduce this error, will investigate and get back to you!

daniellok-db avatar Dec 15 '23 07:12 daniellok-db

Hello folks. Same bug was reproducted today, on the same remote tracking server setup, but with a MySQL database and GCS bucket storage.

A possible workaround I implemented on my system this morning is to flush partially the "datasets" table, before calling mlflow gc, with the following SQL command: DELETE datasets FROM datasets JOIN experiments ON datasets.experiment_id = experiments.experiment_id WHERE experiments.lifecycle_stage = "deleted";

vbousson avatar Dec 15 '23 10:12 vbousson

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

github-actions[bot] avatar Dec 22 '23 00:12 github-actions[bot]

Can we reopen the issue, because the MR is not merged?

efcy avatar Jul 03 '24 20:07 efcy

@BenWilson2 seems like this wasn't merged yet. Could you please reopen the issue?

cile98 avatar Jul 10 '24 14:07 cile98

Please reopen this.

namirinz avatar Jul 11 '24 07:07 namirinz

any updates?

wj-c avatar Aug 22 '24 02:08 wj-c