OneTrainer [Feat]: Add nvidia EDM2 methods?

Describe your use-case.

The new training method(s) described in https://developer.nvidia.com/blog/rethinking-how-to-train-diffusion-models/ sounds amazing: more accurate models, with less training. Any chance of this "EDM2" style showing up in OneTrainer?

What would you like to see as a solution?

EDM2 ("optimizer"? not sure where it fits)

Have you considered alternatives? List them here.

No response

Jun 27 '24 04:06 ppbrown

Sounds promising.

Key innovations include:

magnitude preservation principles
controlled learning rate decay
eliminating group normalization layers

But, since there is also the need of modifying the ADA model structure itself (remove norm layers and add 1/4 pixel norm layers), this involves considerable effort.

Jul 06 '24 13:07 larsupb

But, since there is also the need of modifying the ADA model structure itself (remove norm layers and add 1/4 pixel norm layers), this involves considerable effort.

Isn't this a proposal to build models differently? That's not something OT can do, and I don't think they are proposing to just cut out existing norm layers from existing models, do they?

If I'm right, this issue should be closed.

Apr 06 '25 11:04 dxqb

i confess I dont understand the depths of the article. But given that the article is specifically titled, "How to TRAIN" diffusion models, not "how to build new, better diffusion models", I would think it specifically applies to OneTrainer cutting out layers from training.

Apr 06 '25 17:04 ppbrown

They do mention post-hoc EMA (also requested in #759 ), but everything else seems to be an architectural change.

Apr 06 '25 17:04 Nerogar

To me, the core new idea of the article is,

To eliminate this, we rescaled the weights to always remain at unit magnitude (with some subtleties—refer to the paper appendix or the code release for details).

Isnt that something that could be done in OT without changing the architecture of the model?

as larsupb says, presumably "this involves considerable effort." But if the paper is to be believed, the payoff could be significant.

Apr 06 '25 17:04 ppbrown

No. It requires modifying all the layers that change the activation norm. It's a pretty big architectural change.

Apr 06 '25 17:04 Nerogar

Adding random useful links before the bulk of my comment:

Their actual full paper is at: https://arxiv.org/pdf/2312.02696

And working code is at: https://github.com/NVlabs/edm2

That being said.. I do see that the paper talks about "injecting layers" to do the normalization. Is it possible they are talking about temporary injection of the layers for training purposes, but then removing them for delivery of the final product model?

Hmm. According to GPT O1 digestion of the paper, the answer is "no".

Bottom line: These normalization/“weight-bounding” layers are intended to be permanent modifications to the architecture that keep the model stable at all times—rather than ephemeral modules that exist solely for training and get dropped at inference.

So on the one hand, it sounds like a really useful thing. On the other hand, it probably belongs in, at best, a conversion tool under the existing "convert models tools" area.

If you wish to close this issue as "wont do", I would find that understandable. Seems like it could be nice as a placeholder for that convert tool though.

Apr 06 '25 18:04 ppbrown

interesting, but not for finetuning

Apr 13 '25 14:04 dxqb