Data Engine: Autolog to MLflow on datasource queries

Open kbolashev opened this issue 10 months ago • 2 comments

Implemented in this PR:

Added a link to open the query in the gallery view
Added autologging only on ds.all() of queries to MLflow
The logging happens in a parallel thread so as to not interfere with the querying
Querying works with multiple datasources, correctly saving the query for each ds.
A new run is created in the last modified experiment per datasource that's being queried, all the queries are then logged into that run.

Left to implement:

RIght now the artifacts all get logged with the same name, so they are being overwritten every time. I need to add a tracker for how many logs has been saved.
Probably would be a good idea to have the autologging togglable in some ways, because the runs created are not being closed and are going to pollute the environment, so I expect advanced users to be able to turn this off.

Caveats/bugs:

Closing a run doesn't work if there's a logging session going on to another repo's MLflow. Possible solution - if the current datasource is not in the active run's repo, log in foreground instead of background.

Apr 14 '24 08:04 kbolashev

Changes to be done:

Log ONLY to the active run, no point in creating a run on the datasource's repo as it turns out
Come up with a better naming in these cases (probably log the name of the repo+datasource+index), would be good if the index could be gotten from mlflow also.

Apr 14 '24 15:04 kbolashev

Changed it so it logs to the current one. The format is autolog_<datasource_name>_<#>.dagshub.json. Decided that adding the repo is a bit too verbose.

Example:

Apr 30 '24 12:04 kbolashev

client client copied to clipboard

Data Engine: Autolog to MLflow on datasource queries

Implemented in this PR:

Left to implement:

Caveats/bugs:

client
client copied to clipboard