Pause and Resume AutoML experiments
I wonder if there are any plans to provide a way to pause/resume ML.NET AutoML experiments (with exiting the ML.NET process).
One can use AutoML to run experiments and set duration limit in seconds. However, this requires the process running for the entire duration of the experiment.
Has anyone tried running an experiment and implementing some sort of pause/resume functionality?
Use Case: Pause an experiment and resume after reboot
- Developer starts an AutoML Regression experiment that should run for 20 hours
- 8 hours into the experiment developer pauses the experiment
- ML.NET persists the current state and results of the experiment
- Developer shutdowns or reboots the OS
- Developer resumes the experiment
- ML.NET picks up where it left of and continues the experiment until it has run the whole 20 hours
The real use case for me is running long experiments at night. I'm not using my computer between 02:00 and 07:00. How can I run experiment for those 5 hours, pause the experiment and continue the next night?
This is something we are working on in tooling right now as we continue to improve our AutoML experience!
@LittleLittleCloud can provide more details if you have additional questions.
I would like to add that it would be great to have "skip current trial" and "stop training" functionality alongside the pause/resume mentioned above, ideally accessible through the experiment's IMonitor interface.
Sometimes there are certain trials in an experiment that run significant longer than the previous ones. In those cases I would like to implement a logic that skips that particular trial, because that probably doesn't worth the time spending on it, hoping that the next trial would run normally within the time limits I expect.
I'll include my recent logs to illustrate my point:
14:38:16.0 info: BBD.BodyMonitor.API[0] Bio Balance Detector Body Monitor v0.9.0.0
14:38:16.0 info: BBD.BodyMonitor.API[0] (The current UTC time is 2022-09-11 12:38:16)
14:38:16.6 info: Microsoft.Hosting.Lifetime[14] Now listening on: https://localhost:7061
14:38:16.6 info: Microsoft.Hosting.Lifetime[14] Now listening on: http://localhost:5061
14:38:16.6 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down.
14:38:16.6 info: Microsoft.Hosting.Lifetime[0] Hosting environment: Development
14:38:16.6 info: Microsoft.Hosting.Lifetime[0] Content root path: C:\Work\BioBalanceDetector\Software\Source\BBDProto08\BBD.BodyMonitor\BBD.BodyMonitor.API\
14:38:29.7 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Starting model training for 1800 seconds, using 'BBD_20220829_TrainingData_MLP09_5p0Hz-150000Hz_IsSubject_AndrasFuchs_3500rows.csv' as data source with the 'MLP09' profile.
14:39:05.9 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 0 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: -0,03 - Duration: 36 seconds
14:39:05.9 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] New Best Trial # 0 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: -0,03
14:39:54.1 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 1 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: -0,71 - Duration: 48 seconds
14:40:41.7 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 2 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,92 - Duration: 48 seconds
14:40:41.7 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] New Best Trial # 2 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,92
14:41:17.5 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 3 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,66 - Duration: 36 seconds
14:42:25.2 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 4 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,91 - Duration: 68 seconds
14:43:16.6 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 5 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,81 - Duration: 51 seconds
14:44:08.7 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 6 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,95 - Duration: 52 seconds
14:44:08.8 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] New Best Trial # 6 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,95
14:45:04.6 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 7 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,90 - Duration: 56 seconds
14:46:17.7 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 8 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: -0,89 - Duration: 73 seconds
14:47:05.3 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 9 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,93 - Duration: 48 seconds
14:48:08.6 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 10 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: -0,95 - Duration: 63 seconds
14:49:18.9 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 11 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: -0,55 - Duration: 70 seconds
14:50:11.2 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 12 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,85 - Duration: 52 seconds
14:51:03.6 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 13 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,92 - Duration: 52 seconds
14:52:11.3 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 14 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,97 - Duration: 68 seconds
14:52:11.3 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] New Best Trial # 14 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,97
15:01:54.2 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 15 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: 0,89 - Duration: 583 seconds
15:03:03.9 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 16 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: -0,97 - Duration: 70 seconds
15:08:05.1 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] Completed Trial # 17 - Pipeline: ReplaceMissingValues=>Concatenate=>FastTreeRegression - Metric: -0,98 - Duration: 301 seconds
15:08:29.9 info: BBD.BodyMonitor.API.Controllers.MachineLearningController[0] AutoML result: 0,9660835989610341. Saving model as 'o:\Work\BBD.BodyMonitor\BBD_20220829_TrainingData_MLP09_5p0Hz-150000Hz_IsSubject_AndrasFuchs_3500rows.zip'
You can see here that the 15th and 17th trials ran much longer than the others.
We are exploring the option of continue training in AutoML now. And for excessive long trials, it most likely happens in tree-base trainers when the NumberOfTree or NumberOfIteration is large. I'll push a PR to mitigate this.