ray icon indicating copy to clipboard operation
ray copied to clipboard

[air] Deprecate MlflowTrainableMixin, move to setup_mlflow() function

Open krfricke opened this issue 2 years ago • 3 comments

Why are these changes needed?

Following #29828, this PR introduces a setup_mlflow() method to replace the previous wandb_mixin decorator.

The current mlflow mixin can't be used with Ray AIR trainers. By adding a setup_mlflow() function that can just be called in (data parallel) function trainers, we follow the recent changes to wandb and provide a similar interface.

setup_mlflow also follows the wandb integration by acting on the configuration dict.

Lastly, this PR moves AIR integration tests to the correct subdirectory and combines the old test_mlflow.py into the new test_integration_mlflow.py.

Related issue number

Closes #27966

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

krfricke avatar Dec 22 '22 11:12 krfricke

Should we raise a warning if both the callback and setup_mlflow are used?

Yard1 avatar Jan 03 '23 10:01 Yard1

It's hard to detect (as we are in different processes). Maybe it works even? :-D

krfricke avatar Jan 03 '23 10:01 krfricke

It probably will, but I imagine it could be an easy mistake to make (I doubt you'd want to log the same metrics twice). Still, saying that in the docs should be enough.

Yard1 avatar Jan 03 '23 10:01 Yard1

Working with MLFow Projects in combination with ray is quite buggy.

When running a Mlflow project, a mlflow session will be started. Then ray will put all mlflow logs into the same run. So the losses, parameters, ... are just overwritten. Same as in #19909

Also there are imho some errors in the setup. When trying to put everything in seperate runs when calling setup_mlflow(config, rank_zero_only =False) An error is raised that the experiment with the tune name! does not exist. This is a bug in my opinion and the experiment_name and run_name are mixed up.

My suggestion is:

  • If no experiment name is provided, use the default experiment (id=0)
  • If the experiment name is provided actually use the name and do not overwrite that with the default_trial_name
  • Rework the get_world_rank, because this led to an RuntimeError even tough we were in a train session.

SebastianBodza avatar Mar 02 '23 11:03 SebastianBodza

Ok, this seems to be an issue of the following:

The question is, should we use a config-API setup_mlflow(config) or and explicit API setup_mlflow(experiment_name, experiment_id, tracking_uri, …)

Currently the config-dict has the least priority. This means, when any defaults are generated in the code itself, the config dict will be ignored. An example is the following:

    experiment_id = experiment_id or default_trial_id
    experiment_name = experiment_name or default_trial_name

    # Setup mlflow
    mlflow_util = _MLflowLoggerUtil()
    mlflow_util.setup_mlflow(
        tracking_uri=tracking_uri or mlflow_config.get("tracking_uri", None),
        registry_uri=registry_uri or mlflow_config.get("registry_uri", None),
        experiment_id=experiment_id or mlflow_config.get("experiment_id", None),
        experiment_name=experiment_name or mlflow_config.get("experiment_name", None),
        tracking_token=tracking_token or mlflow_config.get("tracking_token", None),
        create_experiment_if_not_exists=create_experiment_if_not_exists,
    )
    mlflow_util.start_run(
        run_name=experiment_name,
        tags=tags or mlflow_config.get("tags", None),
        set_active=True,
    )
    mlflow_util.log_params(_config)

There the experiment_id and experiment_name are set with the trial_id and trial_name. Imho there is a mixup of the experiment_name and run_name. Defaulting the run_name to the trial_name makes sense for me, however for the experiment_name this makes no sense. We should stick there to experiment_id=0 (default Experiment for mlflow).

Setting the experiment_id to the default_trial_id makes no sense, because the convention of ray is not the same for mlflow.

Also we should switch the order of the or statement, so the config-dict raises in priority.

Lastly, since tune gets configured with dicts, I would stick to the call with a dictionary. Additionally same with wandb the config is logged as a dict.

SebastianBodza avatar Mar 02 '23 11:03 SebastianBodza