generative-models Design question: Why don't you use v-prediction target?

Hi! First of all thanks for a very good model. The Stable Diffusion v2 used v-prediction target and argued that it's better than default epsilon prediction, but why do you use the epsilon target for SDXL training again?

Aug 07 '23 14:08 bonlime

Same question. Somebody knows the reason?

Aug 11 '23 08:08 JincanDeng

should also ask what were the results (if tested) of x-prediction, and how come that isn't used.

Aug 13 '23 21:08 bghira

i've got a version of SDXL with v-prediction and zero-terminal SNR :-)

Sep 20 '23 06:09 bghira

@bghira interesting! could you provide any details on how long does the fine-tuning take? ~estimate of GPU hours + GPU used would be sufficient Also how does it compare to vanilla SDXL in your experiments?

Sep 20 '23 08:09 bonlime

on a single A100-80G it's taking an eternity. would love to have the compute that was offered by StabilityAI months ago but I've had to do it all on my own.

the contrast is much better on SDXL once you switch to v-pred / zero-terminal SNR. but coherence suffers, presumably because of my low batch size.

currently got a test going on 8x A6000 with 4*4*8 batch size configuration, and it learns much more quickly, but at far higher cost.

currently on 16,000 steps and i expect about 50,000-60,000 will be needed to fully reproduce the results of the Bytedance paper that introduced this noise schedule, which matches their results too.

we see 90 seconds per iteration. 400 GPU hours to hit 16,000 steps, or, a little over 2 weeks of constant training.

is anyone from Stability AI even paying attention to this repo anymore? @mcmonkey4eva ?

Sep 20 '23 14:09 bghira

@bghira woah, that's a lot of compute, interesting to see what would come out of it

Sep 20 '23 15:09 bonlime

here's some more cherry-picked results. it's starting to feel like the removal of the attention from the high res layers means the model can't really learn fine details. this is with a timestep training bias toward the final 20% of timesteps, too. you see the fine details end up as a grid of artifacts almost.

another thing is the splotchy contrast, presumably due to the long term use of offset noise during SDXL's initial training. that stuff is basically impossible to remove.

Sep 20 '23 15:09 bghira

this is with a timestep training bias toward the final 20% of timesteps you're only training base image on the [0.2, 1] % of timesteps, and plan to use the vanilla refiner on top of it, right? i've also observed that by default base model is not really good at tiny details but it doesn't usually matter, since refiner can improve everything

Sep 20 '23 16:09 bonlime

no, there is no v-prediction refiner. i am training on 1000 timesteps, but a bias for 25% of them.

Sep 20 '23 16:09 bghira

just an update on this, i personally went ahead and made a v-prediction model from scratch using min-snr-gamma. you can use it as ptx0/terminus-xl-gamma-v1 or a WIP checkpoint at ptx0/terminus-xl-gamma-training - this one is the latest/greatest.

some of the more recent observations are that v-prediction works at a much lower CFG and with many fewer steps than an epsilon XL model does. much better fine details and contrast.

no reason to make epsilon models anymore - the only benefit is training is more stable, which is honestly not a good enough reason to use it. I trained my model on a single GPU.

Nov 07 '23 04:11 bghira

@bghira just to clarify - your experience is that it's better to train from scratch, rather than trying to fine-tune with new prediction target?

do you think it would be possible to train a v-prediction version for Consistency Models as well (LCM)? Not by you, just theoretically do you envision any problems with that?

Nov 07 '23 07:11 bonlime

terminus-xl-gamma-v2 is released now with major improvement in quality.

Dec 18 '23 13:12 bghira

generative-models generative-models copied to clipboard

Design question: Why don't you use v-prediction target?

generative-models
generative-models copied to clipboard