stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: 20% reduction in performance for single image generation due to latest commit fix for VRAM spike

Open ice051128 opened this issue 2 years ago • 11 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

The commit 67efee33a6c65e58b3f6c788993d0e68a33e4fd0 has led to about 20% increase in processing time for single image generation due to significant delay introduced during final step time. For a single generation of 768x1024, it introduces 3s of delay meanwhile the actual processing time for all steps only takes 16s so it ends up at 19s in reality as shown in console.

The commit should be reevaluated since the performance penalty is way more significant than the author of the commit has expected, while the benefits for larger batch sizes are minimal even for 10GB of VRAM in cases for higher resolutions (e.g. 768x1024).

Hardware configuration: i9 10900k, 3080 (10GB), 32GB RAM (CL18 4000Mhz) Launch arguments: --administrator -- autolaunch -- opt-split-attention -- xformers --no-half-vae

Steps to reproduce the problem

  1. Generate the same image
  2. Record the time taken
  3. Observe

What should have happened?

There should be no significant degradation in performance.

Commit where the problem happens

67efee33a6c65e58b3f6c788993d0e68a33e4fd0

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Microsoft Edge

Command Line Arguments

No response

Additional information, context and logs

No response

ice051128 avatar Dec 04 '22 10:12 ice051128

I've also noticed a loss in performance after this commit that I didn't see in the original PR. The generation process seems to freeze for a few seconds at the end of image generation. It doesn't happen every time, but it seems to happen more often when using the gpu for something in the background like playing a video on youtube.

BetaDoggo avatar Dec 04 '22 16:12 BetaDoggo

Sorry, I didn't notice this issue in time (would be better if anybody pinged me with "https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/5165").

Still, my point is that for batch size=1 nothing should change. Nothing! If for some weird reason, the new code causes unknown delay, it can be guarded explicitly:

            actual_batch_size = samples_ddim.size(0)
            if actual_batch_size>1:
                x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(actual_batch_size)]
                x_samples_ddim = torch.stack(x_samples_ddim).float()
            else:
                samples_ddim = samples_ddim.to(devices.dtype_vae)
                x_samples_ddim = decode_first_stage(p.sd_model, samples_ddim)

What anybody thinks? @ice051128, will you able to try it? (I cannot test reliably myself, since I have low RAM and outdated PCIe, so my transfer speed between VRAM and RAM is comparably low already).

Also it would be good to know the behavior for batch size=2 currently: is it struggles too or not?

aleksusklim avatar Dec 10 '22 19:12 aleksusklim

What anybody thinks?

tried this one and didn't have spikes for batch size=1 or batch size=2 with generating images with resolution 768x768

            actual_batch_size = samples_ddim.size(0)
            if actual_batch_size>1:
                x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(actual_batch_size)]
                x_samples_ddim = torch.stack(x_samples_ddim).float()
            else:
                samples_ddim = samples_ddim.to(devices.dtype_vae)
                x_samples_ddim = decode_first_stage(p.sd_model, samples_ddim).cpu()

image

fractal-fumbler avatar Dec 11 '22 19:12 fractal-fumbler

with resolution 768x768

with resolution higher than 768x768 lag spikes appearing again for a single image generation (for example, 832x832)

fractal-fumbler avatar Dec 11 '22 20:12 fractal-fumbler

I get different but interesting results. I have a 4090 ON LINUX and have tested the timing of 80 images at (count X size) of 80x1, 40x2, 20x4 and 10x8. xformers with no other command line options sd_v1-4, euler_a, steps=20, cfg=7, 512x512 Using the OLD two lines in question vs the NEW two lines (decode + stack) Time in seconds and sorry I don't know how to format multiple spaces to line things up. Note the similar total times for process_images_inner() indicating I'm not reproducing the problem but do see a big difference in decode and torch_gc times which appear to cancel each other out. OLD shape TOTAL Sample Decode torch_gc 80x1: 151.22 133.82 1.25 7.59 40x2: 111.66 95.28 0.67 7.81 20x4: 88.78 71.32 0.37 8.91 10x8: 61.12 46.06 0.11 6.86

NEW shape TOTAL Sample Decode torch_gc 80x1: 149.38 132.09 8.19 0.57 40x2: 112.53 96.11 7.92 0.55 20x4: 86.84 71.02 7.89 0.18 10x8: 62.01 46.38 7.91 0.16

This issue where the GC goes crazy in one case reminds me of "garbage collector friendly programming in java". I'm curious if it could be mitigated to knock another 10 seconds off the above. I really believe you should get rid of the tqdm stuff and instead print more precise timings for the major steps of the processing.

Finally I'm running Pytorch pre-release version 1.14.0.dev20221206+cu117 but I am NOT doing the model "compile()" in the above tests. pytorch v2 aka 1.14.0dev has that capability.

aifartist avatar Dec 12 '22 05:12 aifartist

80x1 – was this 80 batches of size 1, or was it 1 single batch with size 80 ?

Your timings mean that total delay is not raising nor lowering with the sequential fix, but what about VRAM? Also there's evidence that generating in higher resolution (than default one) might cause larger spikes, even with the fix.

This could be related to main data still in VRAM when we do VAE. Actually, I tried to move everything to RAM before sequential loop, but when I tried it on 512x512 (or 768x768 with 2.0 model) I didn't found it to have any difference for memory usage (and obviously it won't be faster anyway). But I didn't tried it for larger resolutions!

The code was roughly this:

            actual_device = samples_ddim.device
            samples_ddim = samples_ddim.cpu()
            #devices.torch_gc()
            x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(device=actual_device,dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
            x_samples_ddim = torch.stack(x_samples_ddim).float()

I don't think that torch_gc is needed here, I just wanted to see how VRAM drops before VAE.

aleksusklim avatar Dec 12 '22 06:12 aleksusklim

If 80x1 takes 149 seconds and 10x8 takes only 62.01 seconds and I say batch (count X size) then, yes, it is 80 count of 1 images per batch. Also, A1111 has a max batch size of 8 so how can I even do a single batch of 80 images? I will try to help out on this issue even if I duplicate effort. I want to learn this stuff and can debug anything. For instance, why does a batch size of 1 have a large difference in execution time for the two different decode calls in the old and the new versions? Today I will figure that out but just getting started for the day. GPU memory monitoring ... I am still new to this NN/AI/GPU stuff but have 40+ years as a senior programmer. What is the best Linux tool to monitor the GPU usage? Particularly the HWM of usage from the start of the decode call to the end. There is some verbose chart output I've called directly from python before but it is way too much. I just want to query used/free and perhaps HWM as 3 ints returned from some API call or calls.

aifartist avatar Dec 12 '22 23:12 aifartist

To figure out why the decode of a single batch acts different than doing 1 image at a time in a single batch I broke it down into: the one image within the batch decode at a time I broke it down into: x_samples_ddim = [] for i in range(samples_ddim.size(0)): s1 = decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae)) s2 = s1[0] s3 = s2.cpu() # This isn't done in that code before the change x_samples_ddim.append(s3) and timed each of the four steps(not shown). The extra time comes from the copy to the cpu, which wasn't done in the old code. If you don't copy to the cpu this decode part is fast but then the later gc step is slow. But ultimately I learned something important. Timing a single torch line can be quite misleading as most(?) operations are done asynchronously. Later some other operation might stall, not because it is slow, but because it has to wait for earlier steps to complete. For instance, if you copy to the cpu than all earlier processing much be complete. See torch.cuda.synchronize().

But this mean my efforts today were wasted in terms of finding any true root cause. So now that I know this I want a real reproduction where the overall processing is really 20% slower. If I can repro that then I can find where the extra time is REALLY coming from now that I know how to correctly time a torch.cuda operation. Tomorrow is another day.

aifartist avatar Dec 13 '22 05:12 aifartist

A1111 has a max batch size of 8 so how can I even do a single batch of 80 images?

There is ui-config.json, you can set any limit on batch sizes as "txt2img/Batch size/maximum" and "img2img/Batch size/maximum"

The extra time comes from the copy to the cpu, which wasn't done in the old code.

Wrong. It was copied to cpu anyway, further down the code. Either directly, or in NSFW-detection branch.

So now that I know this I want a real reproduction where the overall processing is really 20% slower.

There is no such drastic speed reduction in "normal" processing (otherwise I would have been noticing it too). Probably it might be related to some other stuff like:

  • highres. fix
  • img2img / face restoration / nsfw
  • larger resolution (more than 2x from initial)
  • GPU background load
  • RAM swapping to pagefile
  • VRAM swapping (to "shared memory")
  • Operating system (I'm on Windows 10)

aleksusklim avatar Dec 13 '22 06:12 aleksusklim

Thanks for the reply

On the max batch size. I didn't realize. For my own curiosity I'll see how many my 4090 can do in a single batch.

On the to-cpu issue. The old two lines that were replaced with two other lines and the bug submitter claimed this was the slowdown. I was pointing out the old way did NOT move the data to the CPU. Of course, that would eventually happen, and new to me, is that it has to wait for any asynchronous work to complete. This complicates identifying exactly where any slowdown would be. But know that I know about it I can deal with it.

Yes, perhaps normal processing doesn't repro this. However, once it has completed the decode work with either the old or new way the x_samples_ddim result should be the same. The only difference being that it was copied to the CPU earlier which is why I said the old way, referring to the replaced lines, did NOT do that.

Having said that maybe I should just look at what functions might have longer times if done on the CPU which my simplistic test doesn't do. NSFW, faces, color correction, etc. I'll look tomorrow.

It is possible no fix would be perfect given the desire to eliminate the VRAM spike for small VRAM setups.

Crazy idea?: If we have the batch size > 1 could we leave it all on the GPU, until the last moment, and do all the decode, nsfw, and other stuff one image at a time and loop back to decode the next image and repeat. The original fix just did 8 decodes for batch size=8 instead of one with the side effect that all 8 images are now on the CPU and can perhaps cause a slow down. In other words, do single image processing for EVERYTHING the post processing does before moving the next image off the cpu. Perhaps that isn't clear but I'm falling asleep. :-) Currently we break the batch into per image to decode, put it back together as a single tensor, and then keep breaking it down into single images again to do several other things. Seems convoluted. Instead have a single for loop over the entire post processing including the decode through the output_images.append(image) call. Easier done and tested than actually describing it in words.

aifartist avatar Dec 13 '22 07:12 aifartist

The only difference being that it was copied to the CPU earlier

Personally, I don't really know, whether it frees GPU memory instantly after .cpu() call? As I understand, tensor.cpu() does not move the tensor, it creates a brand new one. So, when the old tensor will be freed, then? It will be freed when it goes out-of-scope for python itself. This is why it is not possible to free a tensor inside a function, since caller-code will hold its reference until the function returns.

But since in my another proposal above I do X=X.cpu(), the memory should be de-facto freed (and simply used for any other purpose afterwards). Also I didn't know that .cpu stalls the pipeline right away! Neither I know, is one .cpu for the whole tensor is comparably faster than several .cpu on its slices? One thing that I'm sure, though: if the tensor is already in RAM, its .cpu() will be no-op.

the bug submitter claimed this was the slowdown

I believe it was not the "new" slowdown, it was just consolidation of overall slowdowns, now in one place. Your timings confirm this, right?

Currently we break the batch into per image to decode, put it back together as a single tensor

Oh, you think that the whole following processing will not benefit of larger batches? Seems reasonable, because it is not uses CUDA anywhere down the road. Isn't it? I didn't touch the following code because I just wanted my fix to be small and simple. (Firstly, I tried to implement decode_first_stage_sequential(), but when I experimented with freeing samples_ddim, I realized that it won't be possible to free it there, so I decided to change code in-place).

Will the proposed sequential pipeline worth the time it saves? I can see only one possible reason why it can save time: avoid swapping a lot of RAM at the end of each batch. So, if the user has enough memory already – it will not feel any benefit from sequential image saving (especially on SSD/NVMe). But that just my untested guess…

aleksusklim avatar Dec 13 '22 08:12 aleksusklim