mmengine [CodeCamp2023-325] Find the proper learning rate

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

The primary aim of this pull request is to introduce a tuning methodology for automatically determining the optimal learning rate for model training. Hyperparameter tuning, especially finding the optimal learning rate, is crucial for effective model training. The optimal learning rate serves as a starting point that can significantly reduce the required time and resources in broader hyperparameter space exploration. Given the inherent expensive nature of experiments, adopting a black-box optimization formulation, where the input is the hyperparameter and the output corresponds to model performance, is a strategic choice.

Modification

In this PR, we've integrated a tuning concept that focuses on black-box optimization strategies, such as evolutionary algorithms and Bayesian optimization, to discover the best learning rates. Recognizing the intricate nature of these strategies, instead of implementing from scratch, we've incorporated external libraries like Nevergrad (developed by META), ensuring robustness and efficiency in our search process.

Structure & Roles

Tuner

The Tuner serves as the main orchestrator for the hyperparameter tuning process.

Responsibilities:
- Injects hyperparameters into the runner configuration.
- Initiates the training/evaluation process with the given set of hyperparameters.

Report Hook

This component acts as an intermediary, gathering results from the training and evaluation phases and formatting them for further analysis.

Responsibilities:
- Monitors the training process up to a specified number of tuning iterations or epochs.
- Extracts key performance metrics and scores from the Runner's outputs.
- Reports these results in a standardized format, making them ready for analysis and further decision-making.

Searcher

The Searcher operates in the realm of the hyperparameter space. Using historical data and sophisticated optimization techniques, it suggests the next set of hyperparameters to be evaluated.

Responsibilities:
- Analyzes the history of hyperparameters and their corresponding performance metrics.
- Suggests a suitable candidate point in the hyperparameter space for the next round of training/evaluation.
- Can integrate with external optimization libraries/tools such as Hyperopt, Scikit-optimize, or Microsoft's CFO to make informed recommendations.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repos? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

torchrun --nproc_per_node 2 examples/tune/find_lr.py --launcher pytorch

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
The documentation has been modified accordingly, like docstring or example tutorials.

References

mim grid search: https://github.com/open-mmlab/mim/blob/main/mim/commands/gridsearch.py
pytorch lightning tuning: https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/tuner/tuning.html#Tuner
ray tune searcher: https://docs.ray.io/en/latest/tune/api/suggestion.html

TODO

[x] unit tests
[x] docstring

Aug 23 '23 09:08 yhna940

All committers have signed the CLA.

Aug 23 '23 09:08 CLAassistant

Thank you for your contribution. Your PR message and docstring helped me understand your design. Before we delve into implementation details, let's consider the relationship between the runner and tunner. Could we have the runner manage the ReportHook and Searcher, or perhaps make the Tunner an attribute of the runner? This would provide users with a friendlier experience, enabling automatic hyperparameter discovery in end-to-end training with the Runner.

Aug 30 '23 04:08 HAOCHENYE

Thank you for your contribution. Your PR message and docstring helped me understand your design. Before we delve into implementation details, let's consider the relationship between the runner and tunner. Could we have the runner manage the ReportHook and Searcher, or perhaps make the Tunner an attribute of the runner? This would provide users with a friendlier experience, enabling automatic hyperparameter discovery in end-to-end training with the Runner.

@HAOCHENYE Thank you for your feedback. Your suggestion resonated with me, and I believe you've raised a valid point.

In line with your suggestion, I've added a class method to the Runner class named from_tuning. The overarching idea is to position the Tuner as an auxiliary tool when instantiating the Runner, thus confining the lifespan of the Tuner and enabling the Runner to orchestrate the tuning process.

For example, users can now employ the following streamlined approach:

runner = Runner.from_tuning(
    runner_cfg=runner_cfg,
    hparam_spec={
        'optim_wrapper.optimizer.lr': {
            'type': 'continuous',
            'lower': 1e-5,
            'upper': 1e-3
        }
    },
    monitor='loss',
    rule='less',
    num_trials=16,
    tuning_epoch=2,
    searcher_cfg=dict(type='NevergradSearcher'),
)
runner.train()

This design not only enhances code readability but also provides users with a seamless experience of automatic hyperparameter discovery integrated with end-to-end training using the Runner.

I appreciate your valuable insights and would love to hear any further thoughts or feedback you might have.

Aug 31 '23 08:08 yhna940

Thank you for your contribution! I believe the current solution is reasonable. However, I'm curious if it's possible to integrate the process of calling from_tuning into the Runner.train and have Runner control whether or not to use the Tunner through a parameter. What do you think are the potential risks associated with this approach?

Hello, @HAOCHENYE

Thank you very much for your thoughtful suggestion. I appreciate the proactive approach to potentially integrating the from_tuning process directly within the Runner.train method and controlling the utilization of the tuner through a parameter. This could streamline the process, making the tuning more seamless.

However, I think there is a potential risk in combining the two methods. Firstly, at the time of invoking the tuning within the train method, the runner instance has already been instantiated. This means that we will have co-existing instances - the caller runner and the callee runner within the tuner. Both these instances maintain their own models, optimizers, and data loaders, potentially increasing the memory usage considerably.

Furthermore, replacing the attributes of the caller runner with those of the tuned attributes from the post-tuning could introduce considerable complexities. We might find ourselves having to define intricate logic to safely replace the attributes without any adverse side effects. For instance, if we aim to tune the learning rate, we would need to alter the optimizer’s state dict; if we intend to modify the number of data samples, it would require rebuilding the data loader, among other potential modifications. Pre-defining rules for attribute replacement can be a challenging task given the numerous potential scenarios and combinations that would need to be accounted for.

If we are considering integrating tuning within the runner.train, one approach might be to implement a lazy initialization for the runner. This way, the instantiation of the runner is deferred until the train method is invoked, allowing the tuning to complete and the hyperparameters to be decided before the runner instance is created. This could potentially mitigate the concerns mentioned above. However, this approach would entail a significant modification to the runner's operation, and it seems to be a quite extensive and difficult task to undertake in this PR.

I am eager to hear your esteemed opinion on whether my concerns are valid or feedback on this matter.

Sep 13 '23 08:09 yhna940

I'm very sorry for the delayed response. I think your considerations are very reasonable. MMEngine introduced FlexibleRunner in v0.8.0, which can fully lazy-initiate various components, and it should be able to address the situation where both Tuner and Runner hold two models simultaneously during the training phase. However, this is somewhat unrelated; let's continue to focus on Runner for this PR.

Currently, during initialization, Runner instantiates components like the model, visualizer, and logger. If you want to find the best learning rate during the train, you can still do it similarly to the current approach. You can build a new Runner during train, find the best parameters, and then inject them into the relevant components of the original Runner, rather than directly using the current Runner to search for the best learning rate. However, even with this approach, it doesn't resolve the issue of having two models simultaneously when searching for the learning rate.

For me, both searching for the learning rate during the training phase and searching for it through the from_tuning interface are acceptable. You can implement it according to your preference 😄 .

Sep 20 '23 18:09 HAOCHENYE

I'm very sorry for the delayed response. I think your considerations are very reasonable. MMEngine introduced FlexibleRunner in v0.8.0, which can fully lazy-initiate various components, and it should be able to address the situation where both Tuner and Runner hold two models simultaneously during the training phase. However, this is somewhat unrelated; let's continue to focus on Runner for this PR.

Currently, during initialization, Runner instantiates components like the model, visualizer, and logger. If you want to find the best learning rate during the train, you can still do it similarly to the current approach. You can build a new Runner during train, find the best parameters, and then inject them into the relevant components of the original Runner, rather than directly using the current Runner to search for the best learning rate. However, even with this approach, it doesn't resolve the issue of having two models simultaneously when searching for the learning rate.

For me, both searching for the learning rate during the training phase and searching for it through the from_tuning interface are acceptable. You can implement it according to your preference 😄 .

Hello, @HAOCHENYE ,

First and foremost, I'd like to express my gratitude for your comprehensive feedback on the proposal. Your insights and the clarity with which you've approached the problem have been immensely beneficial.

I am thankful for your dual-pronged approach suggestion, providing both an integration within the runner.train method and using the from_tuning interface for hyperparameter tuning. After pondering over the two alternatives, I find myself gravitating towards the latter, employing the from_tuning method. Several reasons underpin this preference:

Separation of Concerns: Leveraging the from_tuning method inherently segregates the tuning phase from the training phase. This explicit demarcation ensures that each phase has its specific focus, leading to more structured and understandable code.
Avoidance of Co-existing Attributes: As you rightly pointed out, having the caller runner and the callee runner simultaneously poses memory and complexity concerns. With the from_tuning approach, this co-existence is avoided, leading to a more memory-efficient and streamlined workflow.

I sincerely hope my approach aligns well with the vision of MMEngine. I'd appreciate further comments, reviews, or feedback on this direction or any other aspect of the PR to refine and improve it.

Thank you once again for your guidance.

Sep 24 '23 11:09 yhna940