OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

deterministic LoRA initialization

Open dxqb opened this issue 8 months ago • 2 comments

I noticed during multi-GPU experiments that the model parameters weren't the same on all GPUs. This is because the LoRA initialization used the system seed and was not deterministic. This PR changes that, which is also nice to have for single-GPU training, because we have wondered before why repeating the same training with the same parameters doesn't have the same outcome.

dxqb avatar Apr 20 '25 10:04 dxqb

I am tending towards closing this PR:

  • it is more complicated than expected
  • Multi-GPU https://github.com/Nerogar/OneTrainer/pull/816 doesn't require it anymore. Instead of relying on deterministic initialization on all GPUs, the parameters of GPU 0 are broadcast to all other GPUs, to start with an identical model state since this commit https://github.com/Nerogar/OneTrainer/pull/816/commits/74633b87633cbe27bccf2b580b8126c569d6fe4e
  • this is safer anyway in case of (current and future) bugs in deterministic initialization
  • the benefit otherwise is quite limited

Unless someone is very interested in deterministic LoRA initialization for other reasons, I'd propose to close this PR without merge

dxqb avatar Jul 10 '25 17:07 dxqb

I think there is still value in having this feature, but it's more of a "nice to have". Deterministic initialization could improve reproducibility of training runs, which can make testing easier.

Nerogar avatar Jul 19 '25 19:07 Nerogar