[REQUEST] Muon Optimizer - Different LR for Different Groups
Is your feature request related to a problem? Please describe. When using Muon optimizer, now the method is to setup that in config. Params will be grouped into muon & adam automatically per #7555. However, there's a single LR that passed into both groups - this is not the standard pattern. We should have different LR for the muon group and the adam group (usually, muon should have larger LR). I wonder if there's a way to enable this behavior easier from my side. If not, then we probably need to add a support for different LR.
Describe the solution you'd like Easiest way to achieve this is probably to have "muon_lr" and "adam_lr" instead of a single "lr" key. Then in different groups, fetch accordingly.
The problem statement is when optimizer is marked as "muon", actually both adam and muon optimizer will be used -- muon for >2D weights for hidden layers, adam for the rest. By default "lr" will be learning rate for both optimizer, and this will not be sufficient for advanced users.
The solution could be use "lr" as default learning rate. And "muon_lr" and "adam_lr" as override. If one of them is specified, the specified value will be used instead of default learning rate, this makes these two config item optional.
@PKUWZP @sfc-gh-truwase FYI.