orthogonal lora layer init
see: https://datta0.github.io/posts/rethink-lora-init/
This is just OLoRA but starting from random weights. How can starting from random weights, rather than getting that information from pretrained weights, converge faster? Did you actually run tests? Because in our research, and every other subsequent research showed that OLoRA and other derivatives like PISSA etc. perform better than any random initialization. For a list of studies see.
@tokenizer-decode Thanks for commenting. It would indeed by nice to see a comparison with OLoRA or PiSSA, which the linked blog post didn't test. I could see an argument for the proposed initialization method being easier to use, as the base weights are unchanged, so even if it's not as good, there could be some value. WDYT?
I honestly don't see the performance benefit. But if you think there is an ease of use benefit, there could be some value.
This goes for every other decomposition method, SVD e.g.. If the value is not updating the base weights, we can always let the user use the method with a parameter like no_update and we would turn off the part where we update the base weights.
But I might add, for future readers who are confused, updating base weights is generally where you get the performance.
If it's easier, I can convert this so that the init_lora accepts a callable and users can provide their own initialization function
EDIT: something like
class InitLoraWeights(Protocol):
def __call__(self, layer, adapter_name) -> None:
pass
and the Config typing would look something like:
bool | Literal[...] | InitLoraWeights
here's GRPO + PEFT. olora initialization goes straight to 0.0 rewards after the first step.
Thanks for running the tests :tada: Is the script open so that we can check what might be going on with OLoRA?
If it's easier, I can convert this so that the init_lora accepts a callable and users can provide their own initialization function
In general, we would like to avoid this, even though it could be practical. The reason is that we wouldn't be able to serialize the LoraConfig into JSON with values that are Python code.
In sum, I think we can still proceed with the orthogonal weight initialization method. As I mentioned, even if it did not outperform OLoRA or similar methods, it could still be valuable as a more user friendly option.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
@winglian Do you have time to finish the PR? If not, let us know so that one of us can take over.
@winglian I finished up the PR in #2498, would be grateful if you could take a look. Of course, I would add you as a co-author (we could add @datta0 as well).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
@winglian I merged #2498, which supersedes this PR, so I'm closing it now. I added you as co-author.