MONAI icon indicating copy to clipboard operation
MONAI copied to clipboard

Develop experiment management module

Open Nic-Ma opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. To record and track the training experiments clearly, experiment management is a necessary module.

  1. Identify the typical user stories
  2. Identify the features we should support
  3. ~Design the module and APIs, which can easily support different backends, like MLFlow, AIM, etc.~
  4. Try to apply MLFlow in the Auto3DSeg application.

Nic-Ma avatar Aug 12 '22 16:08 Nic-Ma

I'm starting to write bundles which choose new output directories every time the training script is invoked so that runs get placed in unique locations. I want to record the loggers to a log file in that directory but it would also be good to write the current configuration that bundle is using so that one can see what was changed one run to the next. This won't include any auxillary code the bundle uses but it would be most of the way there of keeping track of what environment the run used that generated the data in that directory. This is also lighter weight than tools like mlflow and would suit environments this can't be used in.

ericspod avatar Sep 02 '22 15:09 ericspod

Hi @mingxin-zheng @dongyang0122 ,

Let's start to think about this feature request of Auto3DSeg for the next release.

Thanks in advance.

Nic-Ma avatar Sep 19 '22 03:09 Nic-Ma

Hi @Nic-Ma @binliunls @dongyang0122 @ericspod @wyli

Here are some thoughts of mine about MLFlow for Auto3DSeg, from two perspectives: user experience and implementation for the release in MONAI v1.1. Thanks!

  • User Experience:
  1. User can log experiments on localhost and remote tracking server.
  2. User can enable MLFlow in Auto3DSeg modules AlgoGen/BundleGen.
  3. There is two new user arguments train_local and tracking_url. If user wants to run all trainings locally, train_local should be True and tracking_url should be set to 'localhost'. Then MLFlow server will start immediately after the BundleGen/AlgoGen locally. If it is meant to be local, then it will print a message for the user to start the service remotely. It is the user's job to run the server on a remote machine.
  4. When the jobs are done locally, the user can continue to use algo.train() to start trainings with experiment management ON or OFF.
  5. When the jobs will be dispatched remotely, the user needs to override the training command with experiment management arguments, including but not limited to enable_mlflow, tracking_url, experiment_name, params, metrics and so on. Optionally, they can use algo._create_cmd() to see the command to run. Below are some drafts of MLFlow related arguments for the training to take:
    • enable_mlflow: use mlflow as backend
    • tracking_url: use localhost or remote ip address for the mlflow server
    • experiment_name, required by mlflow
    • params: a set of keys to log in training (before the iterations)
    • metrics: a set of keys to log in training (during the iterations)
  • Implementation
  1. A new base class ExperimentManager with MLFlowExperimentManager as the only subclass in MONAI 1.1 .
  2. The MLFlowExperimentManager can initiate the server locally and records where it keeps the database. It can print a helper message if the server will start remotely. (Local server use SQLite as backend?)
  3. The MLFlowExperimentManager manages experiment_name and run_name
  4. The MLFlowExperimentManager manages a list of params names to log. About log_params in mlflow:
log_param and log_params, for logging anything that is "onetime" for each experiment-run, including model parameters and other hyperparameters. An error will be thrown if the same parameter name is logged more than once in the same run.
  1. The MLFlowExperimentManager manages another list of metrics names to log. About log_metrics in mlflow:
log_metric and log_metrics, for logging numerical values during training. Epoch numbers need to be specified; otherwise, MLFlow will report a conflict error.
  1. Support of pictures/text files (artifacts) are excluded from 1.1 unless we are making good progress with other items...
  2. Finally, the name of params and metrics need to be exactly the same. For example, if we're tracking max_epochs in the param buffer, the variable in the train.py has to be max_epochs. It can't be total_epochs or num_epochs.
  3. With assumptions in 7, we may iterate all the buffer items by finding the params and metrics during the running of train.py. If the key is the name of a variable value, then it will trigger the mlflow.log_metrics or mlflow.log_params wrapped inside the MLFlowExperimentManager.

mingxin-zheng avatar Oct 19 '22 07:10 mingxin-zheng

Hi @mingxin-zheng ,

Thanks for the proposal, I agree we should put pictures / texts logging as P1 tasks. And to unify the naming, we may need to predefine them in the MLFlowExperimentManager.

Thanks.

Nic-Ma avatar Oct 19 '22 14:10 Nic-Ma

Hi @Nic-Ma , another reason of making it P1 tasks is because of this MLFlow issue. I am doubtful about the support of logging pics and texts in a remote server.

mingxin-zheng avatar Oct 20 '22 03:10 mingxin-zheng