MONAI
MONAI copied to clipboard
Develop experiment management module
Is your feature request related to a problem? Please describe. To record and track the training experiments clearly, experiment management is a necessary module.
- Identify the typical user stories
- Identify the features we should support
- ~Design the module and APIs, which can easily support different backends, like MLFlow, AIM, etc.~
- Try to apply
MLFlow
in the Auto3DSeg application.
I'm starting to write bundles which choose new output directories every time the training script is invoked so that runs get placed in unique locations. I want to record the loggers to a log file in that directory but it would also be good to write the current configuration that bundle is using so that one can see what was changed one run to the next. This won't include any auxillary code the bundle uses but it would be most of the way there of keeping track of what environment the run used that generated the data in that directory. This is also lighter weight than tools like mlflow and would suit environments this can't be used in.
Hi @mingxin-zheng @dongyang0122 ,
Let's start to think about this feature request of Auto3DSeg for the next release.
Thanks in advance.
Hi @Nic-Ma @binliunls @dongyang0122 @ericspod @wyli
Here are some thoughts of mine about MLFlow for Auto3DSeg, from two perspectives: user experience and implementation for the release in MONAI v1.1. Thanks!
- User Experience:
- User can log experiments on localhost and remote tracking server.
- User can enable MLFlow in Auto3DSeg modules AlgoGen/BundleGen.
- There is two new user arguments
train_local
andtracking_url
. If user wants to run all trainings locally,train_local
should be True andtracking_url
should be set to 'localhost'. Then MLFlow server will start immediately after the BundleGen/AlgoGen locally. If it is meant to be local, then it will print a message for the user to start the service remotely. It is the user's job to run the server on a remote machine. - When the jobs are done locally, the user can continue to use
algo.train()
to start trainings with experiment management ON or OFF. - When the jobs will be dispatched remotely, the user needs to override the training command with experiment management arguments, including but not limited to
enable_mlflow
,tracking_url
,experiment_name
,params
,metrics
and so on. Optionally, they can usealgo._create_cmd()
to see the command to run. Below are some drafts of MLFlow related arguments for the training to take:-
enable_mlflow
: use mlflow as backend -
tracking_url
: use localhost or remote ip address for the mlflow server -
experiment_name
, required by mlflow -
params
: a set of keys to log in training (before the iterations) -
metrics
: a set of keys to log in training (during the iterations)
-
- Implementation
- A new base class
ExperimentManager
withMLFlowExperimentManager
as the only subclass in MONAI 1.1 . - The
MLFlowExperimentManager
can initiate the server locally and records where it keeps the database. It can print a helper message if the server will start remotely. (Local server use SQLite as backend?) - The
MLFlowExperimentManager
managesexperiment_name
andrun_name
- The
MLFlowExperimentManager
manages a list ofparams
names to log. Aboutlog_params
in mlflow:
log_param and log_params, for logging anything that is "onetime" for each experiment-run, including model parameters and other hyperparameters. An error will be thrown if the same parameter name is logged more than once in the same run.
- The
MLFlowExperimentManager
manages another list ofmetrics
names to log. Aboutlog_metrics
in mlflow:
log_metric and log_metrics, for logging numerical values during training. Epoch numbers need to be specified; otherwise, MLFlow will report a conflict error.
- Support of pictures/text files (artifacts) are excluded from 1.1 unless we are making good progress with other items...
- Finally, the name of params and metrics need to be exactly the same. For example, if we're tracking
max_epochs
in the param buffer, the variable in thetrain.py
has to bemax_epochs
. It can't betotal_epochs
ornum_epochs
. - With assumptions in 7, we may iterate all the buffer items by finding the
params
andmetrics
during the running oftrain.py
. If the key is the name of a variable value, then it will trigger themlflow.log_metrics
ormlflow.log_params
wrapped inside theMLFlowExperimentManager
.
Hi @mingxin-zheng ,
Thanks for the proposal, I agree we should put pictures / texts
logging as P1 tasks.
And to unify the naming, we may need to predefine them in the MLFlowExperimentManager
.
Thanks.
Hi @Nic-Ma , another reason of making it P1 tasks is because of this MLFlow issue. I am doubtful about the support of logging pics and texts in a remote server.