stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

Make VAE step sequential to prevent VRAM spikes, will fix #3059, #2082, #2561, #3462

Open aleksusklim opened this issue 2 years ago • 5 comments

Many people are upset when they have just enough VRAM to process a large batch of images, but CUDA usage spikes up at the very end throwing error at 100% progress, not saving images.

This comes from the VAE final step. The code:

samples_ddim = samples_ddim.to(devices.dtype_vae)
x_samples_ddim = decode_first_stage(p.sd_model, samples_ddim)

I believe the problem is because we ask VAE to convert from latent space the whole batch of images, which requires additional VRAM, since we don't delete latent representation from CUDA memory.

Then, the code goes:

x_samples_ddim = torch.clamp((x_samples_ddim + 1.0) / 2.0, min=0.0, max=1.0)
del samples_ddim
if shared.cmd_opts.lowvram or shared.cmd_opts.medvram:
    lowvram.send_everything_to_cpu()
devices.torch_gc()
if opts.filter_nsfw:
    import modules.safety as safety
    x_samples_ddim = modules.safety.censor_batch(x_samples_ddim)
for i, x_sample in enumerate(x_samples_ddim):
    x_sample = 255. * np.moveaxis(x_sample.cpu().numpy(), 0, 2)
    x_sample = x_sample.astype(np.uint8)
    …

Image-space tensor is always transferred to CPU: either at x_sample.cpu(), or inside modules.safety.censor_batch (it starts with def censor_batch(x): x_samples_ddim_numpy = x.cpu().permute(0, 2, 3, 1).numpy()).

This means we can process the batch with VAE sequentially, one-by-one image, storing results to CPU directly. It will avoid requesting more VRAM memory at the cost of slightly increased final step time.

My fix to original code is:

x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
x_samples_ddim = torch.stack(x_samples_ddim).float()

I use tensor slices to make decode_first_stage() believe that we always have batch of 1 image. Then I concatenate resulting CPU tensors and cast them to float32. The subsequent code works as is, and doesn't require any other modifications.

Reasons for making this direct change, rather than option (either runtime or command-line):

  • The delay caused by sequential processing is already very little and not noticeable.
  • It affects only final stage of each batch. The more sampler Steps you have – the less relative time VAE takes, so the overall delay impact will be shorter.
  • If you had batch size = 1, nothing changes for you.
  • The delay is proportional to batch size, which can't be larger than 8.
  • Guys with low-end cards often cannot use batches at all, but with this fix they might.
  • Largest theoretical delay will be for those who already have a lot of VRAM and use largest batch sizes – but their GPU is actually pretty fast and the delay itself will be shorter!

With this fix, I am able to run 768x768 with batch size = 8 on RTX 3060 with 11.7/12 Gb VRAM (--xformers --no-half --no-half-vae); and 512x512 with batch size = 2 on 940MX with 3.7/4 Gb VRAM (--medvram).

aleksusklim avatar Nov 28 '22 12:11 aleksusklim

Issue linking not detected for some reason, so here is the list again:

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3059 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2561 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2082 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3462

aleksusklim avatar Nov 28 '22 12:11 aleksusklim

This is so amazing. Cooking a batch of 20 at 768x768 at 11250MiB / 12288MiB VRAM usage. I think I could push it to 23. cmdline: --listen --xformers --no-half-vae --deepdanbooru

USBhost avatar Nov 29 '22 06:11 USBhost

This is so amazing. Cooking a batch of 20 at 768x768 at 11250MiB / 12288MiB VRAM usage

For some reason, now I double checked and I see 7.5/12 Gb at batch size = 8 for 768*768. Strange. Why I clearly saw 11.7 yesterday? I thought it can be related to Tiling option (which actually lowers VRAM usage), I toggled it on and off; also I changed VAE in settings (used vae from 1.5 on model 2.0-v). But after a clean reset, now I see 7.5 Gb, and not 11.7 Gb, hm-m… Probably, I had memory leak after CUDA errors.

Also I didn't know that it is possible to change batch size. Adjusting ui-config.json? BTW, how big was your maximal usable batch without this fix?

UPD: Oh, I removed --no-half, that's why!

aleksusklim avatar Nov 29 '22 07:11 aleksusklim

BTW, how big was your maximal usable batch without this fix?

I could do 3 before without causing my system/youtube/videos to stutter, after maxed it to 8 and is fine. 10GB 3080.

Update: All testing on 512x512 v1.5, Euler A 16 steps 14 is good, 15 start touching shared mem (not much though). Testing with batch count > 1. Pushed it all the way to 32. No OOM's bit more shared mem. (image below)

image

Update 2: I had show sampling steps on and was creating slow downs. Without it. I can do 32 batch size (little bit of shared mem use), but it seems 16 is around the sweet spot for me (slightly faster).

This also allows me to increase batch size to 3 in Dreambooth! (image below)

image

leppie avatar Nov 29 '22 11:11 leppie

Also I didn't know that it is possible to change batch size. Adjusting ui-config.json?

Yeah that's it. Also idk what was my max before at least 1/3 less. If no one tests it I'll do it when I get home.

USBhost avatar Nov 29 '22 14:11 USBhost

If no one tests it I'll do it when I get home.

I've tested on my 3060 again, SDv2.0, 768x768 with --xformers without --no-half

  • With --no-half-vae without the fix, I get error at size=6. The spike at the end is so high, that VRAM partially swaps to shared media even at size=2! When size=5, it swaps a lot and freezes heavily (I have only 6 Gb of physical RAM though) at the end, noticeably wasting time.
  • Without--no-half-vae and yet without the fix, I get error at size=11. So, it doubles from 5 to 10 accepted sizes. Still swapping to shared memory and lagging.
  • With the fix and with --no-half-vae too, I can generate with size=20 indeed! No swapping, no spikes. After size=21, it allocates something from shared for the whole inference process, making it considerably slower. Error is thrown only at size=24, right at beginning of generation and not at the final step.

Considering its probable swaps to shared memory at the end of generation – now I see that my fix is really faster in all cases, at least for me!

aleksusklim avatar Nov 29 '22 22:11 aleksusklim

I tested this as well, and I am impressed: I went from a maximum images per batch of 9 or 10, usually, to 16 with this new PR, which is the maximum on the interface. So I changed the maximum by adjusting the corresponding parameters in ui-config.json. And now I can generate 25 images per batch (in 512x512, with the 1.5 pruned model + VAE). I can't believe it. And in fact there is still some VRAM left after that on my 2070 Super with 8 GB. I suppose I could do a couple more pictures per batch, but 25 gives me a nice 5x5 grid, and the next one (6x6) is clearly out of reach for me, so I'll leave it like that for now.

TLDR: Maximum images per batch went from 9 to 25 on a 8 GB Nvidia card.

AugmentedRealityCat avatar Nov 30 '22 10:11 AugmentedRealityCat

I went from a max of 3 to 15 at 512x512 on a 6GB RTX 2060. Really impressive!

BetaDoggo avatar Dec 02 '22 21:12 BetaDoggo

I just did a batch of 50 at 512 on a 2070 Super with 8 gigs, and a batch of 18 at x768 after that. You're an absolute legend good sir, I love you!

BlastedRemnants avatar Dec 03 '22 06:12 BlastedRemnants

time to update

2blackbar avatar Dec 03 '22 06:12 2blackbar

If no one tests it I'll do it when I get home.

I've tested on my 3060 again, SDv2.0, 768x768 with --xformers without --no-half

  • With --no-half-vae without the fix, I get error at size=6. The spike at the end is so high, that VRAM partially swaps to shared media even at size=2! When size=5, it swaps a lot and freezes heavily (I have only 6 Gb of physical RAM though) at the end, noticeably wasting time.
  • Without--no-half-vae and yet without the fix, I get error at size=11. So, it doubles from 5 to 10 accepted sizes. Still swapping to shared memory and lagging.
  • With the fix and with --no-half-vae too, I can generate with size=20 indeed! No swapping, no spikes. After size=21, it allocates something from shared for the whole inference process, making it considerably slower. Error is thrown only at size=24, right at beginning of generation and not at the final step.

Considering its probable swaps to shared memory at the end of generation – now I see that my fix is really faster in all cases, at least for me!

Have u tested the fix without --no-half-vae? Also, did it fix to known issue for using vae without --no-half-vae that it would generates black images after certain steps with gpu like 3080?

ice051128 avatar Dec 03 '22 07:12 ice051128

Have u tested the fix without --no-half-vae?

I didn't, because without --no-half-vae it will lower the spike at the end of generation. But in my case, even with --no-half-vae I run out of VRAM for size=21-24 not at the end, but right from the beginning of the generation! So it should not have any difference. Then, --no-half-vae is better at fighting black images that you're referring. And it looks like now this option does not cost anything at all.

aleksusklim avatar Dec 03 '22 09:12 aleksusklim

😯 Now that's a christmas present, from batch size 1 all way up to batch size 8 on my 2GB GPU, thank you very much 🎅

jn-jairo avatar Dec 03 '22 09:12 jn-jairo

Have u tested the fix without --no-half-vae?

I didn't, because without --no-half-vae it will lower the spike at the end of generation. But in my case, even with --no-half-vae I run out of VRAM for size=21-24 not at the end, but right from the beginning of the generation! So it should not have any difference. Then, --no-half-vae is better at fighting black images that you're referring. And it looks like now this option does not cost anything at all.

With the fix and --no-half-vae, I'm able to cook larger batch size with my 3080 (10GB) but it seems like it occasionally causes black image after the first pass. Without the fix, it doesn't happen. Why is that?

ice051128 avatar Dec 03 '22 10:12 ice051128

@USBhost / @AugmentedRealityCat / @leppie / @BetaDoggo https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5409#issuecomment-1345371452

– Can anyone confirm the delay for batch size = 1 ? If so, is the batch size = 2 also affected? If not, is my new proposed guard code (applying fix only for batches larger than one) fixes the delay?

(Pinging all people because the bug might be visible only for specific configurations).

aleksusklim avatar Dec 10 '22 19:12 aleksusklim