stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Feature Request]: Use ram as "spare gpu memory"

Open DuckersMcQuack opened this issue 2 years ago • 5 comments
trafficstars

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

So, windows's default of "shared video memory" as i learned recently was in fact my gen 4 nvme, which is dreadfully slow compared to say ram.

And i want to request a feature where we can allocate X amount of ram to be spare video memory.

Proposed workflow

Preallocate say 15-20GB for us with lots of ram, So whenever video memory is used up, it will then be slower due to not on-gpu fast memory, but not as slow as nvme to store/offload video memory to. To allow us to get "near" quadro video memory amounts, but of course not speeds. But it allows us to generate much larger images with the sacrifice of generation speed due to the way longer travel of generated data.

Or if not possible for the first method, have it instead of loading the model into the gpu, have it instead load the model in the ram, then request from the ram whatever data it needs from the model into the vram, As that will offload the gpu by 2-7GB per model.

Additional information

No response

DuckersMcQuack avatar May 25 '23 16:05 DuckersMcQuack

perhaps this will answer the question: https://superuser.com/questions/1545812/can-the-gpu-use-the-main-computer-ram-as-an-extension

van1027 avatar May 25 '23 22:05 van1027

I might be completely off with this, but doesn't the GPU load the models into VRAM to create something based on your input? Putting it into RAM would mean the GPU needs to pull it out of your RAM instead of your hard drive but the end result would be the same, you would still be limited in VRAM.

I feel the only thing this would achieve is having fewer reads on your hard drive by loading up a bunch of models into RAM.

cospking avatar May 26 '23 08:05 cospking

I might be completely off with this, but doesn't the GPU load the models into VRAM to create something based on your input? Putting it into RAM would mean the GPU needs to pull it out of your RAM instead of your hard drive but the end result would be the same, you would still be limited in VRAM.

I feel the only thing this would achieve is having fewer reads on your hard drive by loading up a bunch of models into RAM.

I've messed around with it, it does work to give you "more" vram, but it does still have a limit to how much you can reasonably use, and it impacts performance significantly

iDeNoh avatar May 26 '23 15:05 iDeNoh

it impacts performance significantly

A performance hit would be preferred to the operation failing due to a lack of VRAM.

If it were only activated when the VRAM was close to running out then you wouldn't notice the performance hit unless you crossed that threshold.

spreck avatar May 26 '23 17:05 spreck

it impacts performance significantly

A performance hit would be preferred to the operation failing due to a lack of VRAM.

If it were only activated when the VRAM was close to running out then you wouldn't notice the performance hit unless you crossed that threshold.

Yeah, that's how it worked, but it was directML on windows, so the performance hit made it about as bad as CPU, but it hit system performance way more

iDeNoh avatar May 27 '23 04:05 iDeNoh

perhaps this will answer the question: https://superuser.com/questions/1545812/can-the-gpu-use-the-main-computer-ram-as-an-extension

My generations that is "too large" already uses the way slower nvme as "pagefile emergency", so i'd rather have the quite faster ram to be the "emergency memory" instead.

DuckersMcQuack avatar Jun 01 '23 01:06 DuckersMcQuack

Offloading even a few GBs to RAM is extremely slow and supported out of the box in new NVidia drivers (535 and above for consumer, 531 and above for pro graphics). Offloading 10+ GBs makes no sense, 99.9% of the generation time will be spent copying data back and forth

LabunskyA avatar Jul 05 '23 23:07 LabunskyA

Offloading even a few GBs to RAM is extremely slow and supported out of the box in new NVidia drivers (535 and above for consumer, 531 and above for pro graphics). Offloading 10+ GBs makes no sense, 99.9% of the generation time will be spent copying data back and forth

Nope! after i experienced this, and webUI practically dies and can use a minute per damn step, i'll take a memory limiter instead and just use --medvram and --lowvram any damn day lol Plex_3gX2GIXHnJ

DuckersMcQuack avatar Jul 13 '23 14:07 DuckersMcQuack

Nope! after i experienced this, and webUI practically dies and can use a minute per damn step, i'll take a memory limiter instead and just use --medvram and --lowvram any damn day lol

Yep, that's what I've meant. You should not use shared memory for actual generation. It's main purpose is not to fail during image reconstruction, but even that basically kills your PC for a minute or two on larger images

LabunskyA avatar Jul 14 '23 08:07 LabunskyA

Such a feature is awesome when used intelligently, For Nvidia at least, they made it into a good default with newer drivers to handle over-allocation in these workloads, and the slowdown is not much on cards with a PCIe 4.0 X16 interface.

For example, some upscaling steps can use more VRAM for a limited amount of time, and in terms of real world performance, a PCIe 4.0 X16 card can usually offer around 25GB/s access to the system RAM. In cases when you get around 3-4GB of spillover, it tends to add around 2 or so minutes or so of additional delay before the memory usage drops to a level that the VRAM can fully handle in the case of upscaling.

If you look at VRAM usage during image generation, often when you get a VRAM failure, it is usually a select few steps that cause a spike in VRAM usage before usage drops to normal levels, thus only a few steps will suffer a slowdown from shared memory.

Razor512 avatar Jun 26 '24 19:06 Razor512