stable-diffusion-webui [Bug]: M1 Mac: Performance degrages severely after 1 generation

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

Did a fresh install this morning to make sure I was running the latest and greatest. Generating a 768x768 image at 20 steps with Euler_a using the 2.0 model. Launched with the --medvram argument but got similar results without it. Batch count and batch size both at 1. The first batch generated in 1:56; the second batch, with identical settings, took 17:04. Cancelling and relaunching the Terminal command does seem to get it back to the initial performance, but only for one batch, thus necessitating quitting and relaunching every time.

Steps to reproduce the problem

Launch web-ui on an M1 Mac.
Load the 2.0 model and set X/Y to 768, sampler to euler_a, cfg to 7, steps to 20.
With batch count and batch size both at 1, generate an image. Should take a reasonable amount of time.
Generate a second image. Should take an order of magnitude more time or more.

What should have happened?

The generation time for the second batch should be more or less the same as the time for the first batch.

Commit where the problem happens

44c46f0

What platforms do you use to access UI ?

MacOS

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

--medvram

Additional information, context and logs

For the purposes of this bug report I launched with --medvram, mostly to see if it would improve performance as suggested in the Apple Installation article on the wiki. With this argument,the first generation happened in approx half the time, but similar results overall were observed in the second generation. (I don't actually know how long the second generation will take without the --medvram argument; the UI was estimating 30 mins, and I cancelled it after 20.)

Dec 06 '22 21:12 twopiearr

Attempted a third generation. It is currently reporting step 5/20 with 3:46 elapsed and 10:50 estimated, so I expect similar results. EDIT: took 13:54.

Dec 06 '22 21:12 twopiearr

Additional information: This doesn't appear to be limited to the 2.0 model. Experimentally I tried cancelling the Terminal process, adding the 1.4 model via a symlink, and relaunching. With identical settings aside from selecting the 1.4 model, I'm getting identical performance: first batch took 2:34, second batch is currently at 5/20 after 3:45 with an ETA of 17:41.

Dec 06 '22 22:12 twopiearr

Using this software on an ARM-based mac is going to give you trouble, guaranteed. Not only is the newer Macs ARM-based (RISC which is most often used in network attached storage devices, cars and coffe makers) but also these CPUs use Apple proprietary GPUs which isn't really supported by this software (basically you need CUDA to make it work without issues).

So get yourself an older Intel mac, put in a Geforce card, or wait until someone magically makes this work on a computer that looks best on a fancy desk. (Sorry, biased, but Apple decided to screw their fans over once more).

Dec 08 '22 01:12 cooperdk

Using this software on an ARM-based mac is going to give you trouble, guaranteed. Not only is the newer Macs ARM-based (RISC which is most often used in network attached storage devices, cars and coffe makers) but also these CPUs use Apple proprietary GPUs which isn't really supported by this software (basically you need CUDA to make it work without issues).

And yet...I didn't have this problem even with this software as recently as 11 days ago. Nor do I have this problem with literally any other implementation I've tried on this machine. So shit on Apple all you want, but I don't think this is an Apple problem, per se.

Dec 08 '22 05:12 twopiearr

It is: trouble with the available compiled modules for Python. AI specific modules that are made to work on nVidia hardware which Apple no longer uses af they were thrown in the garbage bin by nVidia (for whatever reason, but likely because Apple wanted them to make proprietary stuff for them).

One of them is torch which is compiled specifically for CUDA which is an nVidia thing but there are a number of modules giving Mac M1/M2 users issues. Ready easy out, use compatible hardware or ask Apple to make their own (proprietary of course) Apple Diffusion.

⁣ Seriously though. A number of Mac users have reported issues while it worked for them earlier. I have no idea why that is but it might be memory or I don't know. But I imagine that it had to do with the Apple GPU which cannot handle CUDA instructions or tries to, but fails.

Did you try to set up the ui to render cpu only to see if that works?

Only other suggestion I have (seriously) is as mentioned: you need supported hardware and basically torch only supports nVidia. Other than that, try to do a render with the stable diffusion repo (the base one) and check if that works. If so, it is this UI that fails, but I am pretty sure they dont really aim to support the ARM-based Macs as the architecture is completely different and many Py modules are likely not ready to support them.

Dec 08 '22 06:12 cooperdk

then why didn't I have this issue with this software before it integrated support for SD 2.0?

why don't I have this problem with any other implementation (now up to 4 and counting) that I've tried on this machine?

Your firey rhetoric neither fits the facts nor gets any closer to a solution to a problem that is unique to this implementation, so kindly shut up if you can't contribute something useful.

Dec 08 '22 06:12 twopiearr

I am the same problem, 2nd generation is 4 times slower, same settings different batch. first generation: 20/20 [01:44<00:00, 5.21s/it], after 17/20 [05:46<00:58, 19.65s/it]

Dec 09 '22 06:12 holynuts

then why didn't I have this issue with this software before it integrated support for SD 2.0?

why don't I have this problem with any other implementation (now up to 4 and counting) that I've tried on this machine?

Your firey rhetoric neither fits the facts nor gets any closer to a solution to a problem that is unique to this implementation, so kindly shut up if you can't contribute something useful.

Working on Mac requires a lot of memory, it seems. Perhaps more on the later release.

But torch is only built for cuda and (it seems) AMD GPUs do I guess you need the memory to run both Python, the ui and it's modules AND then you need memory for generation.

When I run the UI on Windows with 32 GB ram, I have maybe 35% left when it has started. Loading a model, hypernet, dream etc will take more. It transfers to the GPU as needed, freeing to the ram on the PC.

Since torch is not really built for the Mac gpu, perhaps that's the issue. Slowing down due to high memory usage. I agree that supporting SD2 might have been a mistake at this point since I doubt a lot of people will use it, due to their censoring.

Did you try to check ram and cpu usage while doing first and second generation?

Dec 09 '22 07:12 cooperdk

Open activity viewer and look at your memory pressure. I have a 32GB RAM Studio here and it's pretty peppy (for a Mac running SD) but once the memory usage gets beyond the physical RAM, it will start swapping to the relatively fast onboard SSD, but you get a serious performance hit in the meantime. I can get up to 40-some odd GB in use before that happens.

Close all other applications and let it run. You can free up GPU RAM by turning off GPU acceleration in your browser.

The unified memory is great as in it will let you continue to run SD even when running out of memory, but you will get dinged for it performance wise.

Dec 16 '22 12:12 rworne

I have read that's how it is on Mac. You need a lot of ram. Probably twice is good. It loads everything in ram and swaps nothing to the gpu. On Windows it uses all available memory but when loading new stuff, it uses ram and 32 mb system ram is basically a minimum in my experience if you have 12 GB of gpu ram, or you get out of memory errors. It might be due to it using everything you have and everything on Mac might include virtual memory which it won't use on Windows.

Dec 16 '22 16:12 cooperdk

Open activity viewer and look at your memory pressure. I have a 32GB RAM Studio here and it's pretty peppy (for a Mac running SD) but once the memory usage gets beyond the physical RAM, it will start swapping to the relatively fast onboard SSD, but you get a serious performance hit in the meantime. I can get up to 40-some odd GB in use before that happens.

Close all other applications and let it run. You can free up GPU RAM by turning off GPU acceleration in your browser.

I have done all this. I've also done all this while using other implementations. Consistently Auto1111 is the only implementation with this problem. I'm not disputing that the unified memory on the Mac causes issues, I'm stating that from all available information, something Auto1111 specifically is doing is causing this lack of optimization. It doesn't happen in InvokeAI. It doesn't happen in DiffusionBee. It doesn't even happen in Draw Things, which is an iPad app sort of retrofit to run on the computer. It's a problem that is unique to Auto1111.

Dec 16 '22 23:12 twopiearr

That may be true, but the way Mac uses resources is the same for all. I read somewhere that you need to use cpu memory for everything with torch and then 32G is not too much considering you likely only have 24 or so available when you run the app. You could install a Linux in parallel, that could give you better memory usage but still, the GPU isn't really supported.

Dec 16 '22 23:12 cooperdk

I don't see this issue running any checkpoint with the default 512x512 size. I've had this run for 12+ hours generating images with no performance hits since a bit over two weeks ago, when the big M1 update in Automatic came out. Prior to that I had issues with it crapping out with a semaphore issue or some sort of fault.

Here is the log from what I have had it do this evening:

Weights loaded.
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:37<00:00,  1.06it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:37<00:00,  1.06it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.06it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]

Here it is with 768x1152, hiresfix:

100%|███████████████████████████████████████████| 20/20 [00:24<00:00,  1.22s/it]
100%|███████████████████████████████████████████| 20/20 [04:32<00:00, 13.62s/it]
Total progress: 100%|███████████████████████████| 40/40 [05:16<00:00,  7.91s/it]
100%|███████████████████████████████████████████| 20/20 [00:24<00:00,  1.24s/it]
100%|███████████████████████████████████████████| 20/20 [04:11<00:00, 12.57s/it]
Total progress: 100%|███████████████████████████| 40/40 [04:52<00:00,  7.31s/it]
100%|███████████████████████████████████████████| 20/20 [00:23<00:00,  1.20s/it]
100%|███████████████████████████████████████████| 20/20 [04:47<00:00, 14.36s/it]
Total progress: 100%|███████████████████████████| 40/40 [05:24<00:00,  8.11s/it]
100%|███████████████████████████████████████████| 20/20 [00:24<00:00,  1.21s/it]
100%|███████████████████████████████████████████| 20/20 [04:45<00:00, 14.29s/it]
Total progress: 100%|███████████████████████████| 40/40 [05:23<00:00,  8.08s/it]

This is just big enough to get it to swap slightly, but no slowdown (last two are slightly longer due to my opening and using Safari). If I have this run all night, it may eventually give me an issue. I'll let it got for a few hours and see what it does. I'll update if I see anything.

EDIT: Came back after a while, average time is ~9.5 sec/iteration.

Dec 17 '22 14:12 rworne

I'm on M1 Pro, 16 gb ram. Here's my results running images today. Euler A, CFG 7, highres fix, 768x1024, 20 steps. When the "fast" ones in the results below finish, there's no image. The image only shows up from the super slow ones. Now I realize it takes longer to make bigger images, but this seems ridiculous. Please correct me if I'm wrong. Granted, being able to produce amazing works of art that fast is wonderful, but it does make it take a long time to cherry pick from hundreds of results.

100%|████████████████████████████████████████████████████████████████| 20/20 [01:04<00:00,  3.23s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [43:02<00:00, 129.12s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:03<00:00,  3.17s/it]
100%|█████████████████████████████████████████████████████████████| 20/20 [1:27:08<00:00, 261.45s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:01<00:00,  3.09s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [39:58<00:00, 119.93s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:00<00:00,  3.04s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:30<00:00, 121.51s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:00<00:00,  3.01s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:09<00:00, 120.49s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:00<00:00,  3.05s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:35<00:00, 121.76s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:01<00:00,  3.06s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:06<00:00, 120.32s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:00<00:00,  3.04s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [39:29<00:00, 118.47s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:02<00:00,  3.11s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [39:59<00:00, 119.99s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:01<00:00,  3.05s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:19<00:00, 120.97s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:02<00:00,  3.11s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:04<00:00, 120.22s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [00:59<00:00,  3.00s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:03<00:00, 120.16s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [00:59<00:00,  2.98s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [39:32<00:00, 118.60s/it]

Dec 17 '22 23:12 barryanders

Those speeds (the slow ones) correspond with the time it takes to generate on cpu only on Windows - with a CPU that is about four generations old.

Knowing it's unlikely that the Apple GPU supports torch, I guess this means that Apple users should try it on Linux, or get capable hardware.

Dec 18 '22 02:12 cooperdk

Here's my results for the same. I have 32GB of RAM:

To create a public link, set `share=True` in `launch()`.
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:20<00:00,  1.03s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [03:57<00:00, 11.86s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [04:27<00:00,  6.70s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:22<00:00,  1.13s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [04:26<00:00, 13.33s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [04:59<00:00,  7.49s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:22<00:00,  1.15s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [04:54<00:00, 14.74s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [05:26<00:00,  8.16s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:22<00:00,  1.10s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [04:45<00:00, 14.30s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [05:23<00:00,  8.08s/it]

What I noticed on my machine the python process is using slightly more than 23GB of RAM with those settings. I'd think yours with 16GB is probably swapping, which explains the crazy long 2nd stage image processing. The first pass goes quickly because if you have the firstpass sizes set to 0,0 it renders a 512x512 image then IIRC, scales it up to the desired resolution on the 2nd pass.

EDIT: I did some more experimentation. Since my machine has double the RAM of yours, I ran the same thing again with double the image sizes. Here's the results:

100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:38<00:00,  1.93s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [26:31<00:00, 79.56s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [28:15<00:00, 42.38s/it]

As you can see with these settings, the performance went into the toilet because it's swapping like mad. My swapfile fluctuated between 12 and 20GB with 100% RAM utilization while rendering this image. In fact, the 768x1024 size you mentioned earlier is very near the largest size I can do on this 32GB machine without it swapping.

cooperdk: While CUDA is not supported, the MPS support in SD is definitely using the GPU, I have steady 80-90+% utilization there while the CPU is bouncing around from 33% to 90%.

Dec 18 '22 03:12 rworne

Thanks for running tests and sharing for comparison.

Edit: After running another batch at 640x960, instead of averaging in the 120s, I was somewhere more around the 20s. Slightly smaller, but significantly faster.

Dec 18 '22 04:12 barryanders

I bought my Studio a month before SD came out. If I knew then what I knew now, I'd have gotten a 64GB unit. I'm not majorly upset about it, as I usually make images as experimentation so 512x512 and 512x768 (and thereabouts) is perfectly fine for me. There's an upper limit bug in the M1s too, as I have tried and let it make larger images, but once you get above 1024x1024, you hit another bug that crashes the program - see #5278 for that one.

SD on M1 came a long way since it was first released. I'm really hoping there's some good optimizations that can be wrung out of it over the next year.

Dec 18 '22 13:12 rworne

I have the earliest model M1(16g), I tried the v2.1 512 model and remove the argument --medvram, it seems working 'normally' again (like before using 1.5 models).

Dec 18 '22 15:12 holynuts

I have the earliest model M1(16g), I tried the v2.1 512 model and remove the argument --medvram, it seems working 'normally' again (like before using 1.5 models).

My command line settings are: --listen --no-half --use-cpu interrogate --skip-torch-cuda-test

It ignores the skip CUDA test, but does the rest of them.

Dec 18 '22 17:12 rworne

I have the earliest model M1(16g), I tried the v2.1 512 model and remove the argument --medvram, it seems working 'normally' again (like before using 1.5 models).

You mentioned earlier you don't see the interim image. If you go to settings and set the value for "Show image creation progress every N sampling steps" to 1 (most frequent) or some other number, it will show the image after each N iteration steps. There is a performance penalty for this, but I find it minimal (maybe a few sec per image). I keep mine at 1 so when I generate that lovely supermodel and it turns out to be a horrid lobstrosity, I can interrupt the image generation and fix the prompt without waiting for it to finish first. This can be a huge time saver.

Dec 18 '22 17:12 rworne

Apple users should try it on Linux, or get capable hardware.

cooperdk, do you seriously have nothing better to do with your life than dunk on hardware in the specific thread created to troubleshoot for it? I'm truly sorry your soul is that empty, but please shut up if you have nothing useful to contribute.

Dec 18 '22 18:12 twopiearr

Same issue here, several months after this thread was created. I didn't have any issues for weeks, 14 or so days up until 2-3 years ago, when A1111 suddenly began crawling. My machine got hot, for the first time, so I figured perhaps my machine was throttling, so I stopped generating for a while. This was 3 days ago, but ever since, I can no longer do more than 2-3 generations before everything grinds to a halt.

I've tried different browsers, no difference, same issue, fast at first and then grinds to a halt. I've tried restarting (of course), I've tried reinstalling, no change.

I don't know what it is but it is making SD unusable for me now, which is sad, because I really want to use it like I have for the past 14 days.

Here is an example of the times I get: https://i.imgur.com/34IZISm.png. The message close to the start is me activating ControlNet openpose, but I've tried without activating it, it doesn't matter. After a few generations my instance grinds to a halt and I can no longer generate properly.

I've also tried different instances and this only seems to happen on A1111. @twopiearr did you guys find a solution to this? It's been a few months now after all.

Thank you!

Feb 23 '23 20:02 ptppan

@ptppan see an incredibly helpful thread here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/5461#discussioncomment-5087953

and no anti-apple bigots, either!

Feb 24 '23 00:02 twopiearr

@ptppan see an incredibly helpful thread here: #5461 (reply in thread)

and no anti-apple bigots, either!

Haha awesome, thank you! Did you find anything in particular that helped you with the issue you described in this thread? Because I am having a similar issue. Looking through that thread, I am not sure what could actually solve this issue, and I'm not seeing any comments by you there? Did you do something recommended in that thread in particular that helped you with your issue?

I am actually unsure if I actually have an issue now or if it's just the usage of LoRAs that slow things down (naturally), or if it perhaps is ControlNet that runs slow on Mac, or whatever it could be. Really confused about this issue, as it seems so vague. But love to know if you followed a particular advice in that thread? Thank you again for providing it!

Edit: Nvm, stupid me, just saw your posts in the thread. Thank you so much, I'll proceed from there.

Feb 24 '23 19:02 ptppan

Same issue here, Since last Sunday morning, it was able to generate a picture in 7 minutes, but it suddenly slowed down by four times. However, there is one detail I noticed. A few days ago, when everything was normal, I exited the webui with control + z , closing the terminal is directly closed, but after this happens, and then close the terminal, it will prompt me that python is still running, are you sure to close it

Mar 07 '23 13:03 alsomail

iShot_2023-03-08_09 20 13 The same parameter generates ten batches, the speed is 15s/it for the first generation, 67s/it for the second generation, and 217s/it for the fourth generation

Mar 08 '23 01:03 alsomail

after contrl + z, the Python still occupies 30g of memory and has not released it iShot_2023-03-08_09 29 45

Mar 08 '23 01:03 alsomail

the Python version is Python 3.10.10 (main, Feb 8 2023, 05:34:50) [Clang 14.0.0 (clang-1400.0.29.202)]

Mar 08 '23 01:03 alsomail

Python does not shut down if you close the terminal window. You have to break it from running. I don't know if you do that with Ctrl+C on Mac. You should be able to shut it down in your process list.

But the web ui was not made to run on Mac. You really should get proper gear to use it.

Mar 08 '23 04:03 cooperdk

stable-diffusion-webui stable-diffusion-webui copied to clipboard

[Bug]: M1 Mac: Performance degrages severely after 1 generation

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Commit where the problem happens

What platforms do you use to access UI ?

What browsers do you use to access the UI ?

Command Line Arguments

Additional information, context and logs

stable-diffusion-webui
stable-diffusion-webui copied to clipboard