client icon indicating copy to clipboard operation
client copied to clipboard

Data Engine: Autolog to MLflow on datasource queries

Open kbolashev opened this issue 10 months ago • 2 comments

Implemented in this PR:

  • Added a link to open the query in the gallery view
  • Added autologging only on ds.all() of queries to MLflow
  • The logging happens in a parallel thread so as to not interfere with the querying
  • Querying works with multiple datasources, correctly saving the query for each ds.
  • A new run is created in the last modified experiment per datasource that's being queried, all the queries are then logged into that run.

Left to implement:

  • RIght now the artifacts all get logged with the same name, so they are being overwritten every time. I need to add a tracker for how many logs has been saved.
  • Probably would be a good idea to have the autologging togglable in some ways, because the runs created are not being closed and are going to pollute the environment, so I expect advanced users to be able to turn this off.

Caveats/bugs:

  • Closing a run doesn't work if there's a logging session going on to another repo's MLflow. Possible solution - if the current datasource is not in the active run's repo, log in foreground instead of background.

kbolashev avatar Apr 14 '24 08:04 kbolashev

Changes to be done:

  • Log ONLY to the active run, no point in creating a run on the datasource's repo as it turns out
  • Come up with a better naming in these cases (probably log the name of the repo+datasource+index), would be good if the index could be gotten from mlflow also.

kbolashev avatar Apr 14 '24 15:04 kbolashev

Changed it so it logs to the current one. The format is autolog_<datasource_name>_<#>.dagshub.json. Decided that adding the repo is a bit too verbose.

Example: image

kbolashev avatar Apr 30 '24 12:04 kbolashev