[Issue]: High VRAM usage during Vae step
Issue Description
Vram usage during the VAE step is inconsistent and will spike to >12 gb for an sdxl model. This is atypical for my usage, where an SDXL model will stay at 10gb or less during the vae step with my settings all applied:
- fp16 mode, vae slicing and vae tiling = true, vae upcast = false 1024x1024 10 steps dpm ++2m, sdxl timestep presets used, cfg = 3, no attention guidance, no loras applied.
Disabling "use cached model config when available" removes the issue, and generation speeds will be 8 -10 seconds.
VRAM usage in the console does not reflect the usage as seen in task manager or in the webui, attached is a screenshot of the vram usage during a run. sdnext (1).log
Version Platform Description
13:30:50-670748 INFO Logger: file="C:\Users\zaxof\OneDrive\Documents\GitHub\nvidia_sdnext\sdnext.log" level=DEBUG size=65 mode=create 13:30:50-672246 INFO Python version=3.10.6 platform=Windows bin="C:\Users\zaxof\OneDrive\Documents\GitHub\nvidia_sdnext\venv\Scripts\python.exe" venv="C:\Users\zaxof\OneDrive\Documents\GitHub\nvidia_sdnext\venv" 13:30:50-859782 INFO Version: app=sd.next updated=2024-09-10 hash=91bdd3b3 branch=dev url=https://github.com/vladmandic/automatic.git/tree/dev ui=dev 13:30:51-186334 INFO Updating main repository 13:30:52-008006 INFO Upgraded to version: 91bdd3b3 Tue Sep 10 19:20:49 2024 +0300 13:30:52-015505 INFO Platform: arch=AMD64 cpu=AMD64 Family 25 Model 33 Stepping 2, AuthenticAMD system=Windows release=Windows-10-10.0.22631-SP0 python=3.10.6 13:30:52-017006 DEBUG Setting environment tuning 13:30:52-018506 INFO HF cache folder: C:\Users\zaxof.cache\huggingface\hub 13:30:52-019506 DEBUG Torch allocator: "garbage_collection_threshold:0.80,max_split_size_mb:512" 13:30:52-026016 DEBUG Torch overrides: cuda=False rocm=False ipex=False diml=False openvino=False 13:30:52-027513 DEBUG Torch allowed: cuda=True rocm=True ipex=True diml=True openvino=True 13:30:52-037517 INFO nVidia CUDA toolkit detected: nvidia-smi present
Extensions : Extensions all: ['a1111-sd-webui-tagcomplete', 'adetailer', 'OneButtonPrompt', 'sd-civitai-browser-plus_fix', 'sd-webui-infinite-image-browsing', 'sd-webui-inpaint-anything', 'sd-webui-prompt-all-in-one']
Windows 11, RTX 3060 12gb, 5700x3d, 64gb ddr4, dev branch SDNEXT, firefox browser on desktop, chrome on android for remote access.
Relevant log output
No response
Backend
Diffusers
UI
Standard
Branch
Dev
Model
StableDiffusion XL
Acknowledgements
- [X] I have read the above and searched for existing issues
- [X] I confirm that this is classified correctly and its not an extension issue
i cannot reproduce. i've added some extra logging, please set env variable SD_VAE_DEBUG=true and run. note that sdnext should be restarted after changing use cached model config.
post logs starting with TRACE for both runs.
re-ran it with the env variable active.
13:51:08-388186 DEBUG Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
13:51:08-390185 DEBUG Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
13:51:08-392183 DEBUG Torch generator: device=cuda seeds=[1158966623]
13:51:08-393184 DEBUG Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 20, 'eta': 1.0,
'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
'height': 1024, 'parser': 'Full parser'}
Progress 1.49it/s █████████████████████████████████ 100% 10/10 00:06 00:00 Base
13:51:15-342317 DEBUG GC: utilization={'gpu': 71, 'ram': 3, 'threshold': 80} gc={'collected': 386, 'saved': 0.66}
before={'gpu': 8.55, 'ram': 2.16} after={'gpu': 7.89, 'ram': 2.16, 'retries': 0, 'oom': 0}
device=cuda fn=full_vae_decode time=0.22
13:51:15-857910 TRACE VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
('mid_block_add_attention', True), ('_use_default_values', ['latents_std',
'use_post_quant_conv', 'mid_block_add_attention', 'latents_mean', 'shift_factor',
'use_quant_conv']), ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.20.0.dev0'),
('_name_or_path', '../sdxl-vae/')])
13:51:15-861410 TRACE VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 137363456, 'total':
12884377600, 'active': 2672, 'active_peak': 9631253504, 'reserved': 11605639168,
'reserved_peak': 11725176832, 'used': 12747014144})
13:51:15-863413 TRACE VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=0.741
13:51:16-051942 DEBUG Profile: VAE decode: 0.93
13:51:16-298983 DEBUG GC: utilization={'gpu': 99, 'ram': 3, 'threshold': 80} gc={'collected': 254, 'saved': 3.97}
before={'gpu': 11.87, 'ram': 2.16} after={'gpu': 7.9, 'ram': 2.16, 'retries': 0, 'oom': 0}
device=cuda fn=vae_decode time=0.25
13:51:16-343487 INFO Save: image="outputs\text\06720-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
type=JPEG width=1024 height=1024 size=133251
13:51:16-345488 INFO Processed: images=1 time=7.97 its=1.25 memory={'ram': {'used': 2.16, 'total': 63.9}, 'gpu':
{'used': 7.9, 'total': 12.0}, 'retries': 0, 'oom': 0}
13:51:22-375787 INFO Base: class=StableDiffusionXLPipeline
13:51:22-377290 DEBUG Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
13:51:22-379288 DEBUG Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
13:51:22-381287 DEBUG Torch generator: device=cuda seeds=[2858245960]
13:51:22-382287 DEBUG Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 20, 'eta': 1.0,
'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
'height': 1024, 'parser': 'Full parser'}
Progress 1.19it/s █████████████████████████████████ 100% 10/10 00:08 00:00 Base
13:51:31-083449 DEBUG GC: utilization={'gpu': 71, 'ram': 3, 'threshold': 80} gc={'collected': 385, 'saved': 0.57}
before={'gpu': 8.46, 'ram': 2.16} after={'gpu': 7.89, 'ram': 2.16, 'retries': 0, 'oom': 0}
device=cuda fn=full_vae_decode time=0.22
13:51:36-590122 TRACE VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
('mid_block_add_attention', True), ('_use_default_values', ['latents_std',
'use_post_quant_conv', 'mid_block_add_attention', 'latents_mean', 'shift_factor',
'use_quant_conv']), ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.20.0.dev0'),
('_name_or_path', '../sdxl-vae/')])
13:51:36-593621 TRACE VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 3857711104, 'total':
12884377600, 'active': 2672, 'active_peak': 9631253504, 'reserved': 7885291520,
'reserved_peak': 11366563840, 'used': 9026666496})
13:51:36-595622 TRACE VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=5.727
13:51:36-606121 DEBUG Profile: VAE decode: 5.74
13:51:36-646633 INFO Save: image="outputs\text\06721-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
type=JPEG width=1024 height=1024 size=150607
13:51:36-648635 INFO Processed: images=1 time=14.29 its=0.70 memory={'ram': {'used': 2.16, 'total': 63.9}, 'gpu':
{'used': 8.41, 'total': 12.0}, 'retries': 0, 'oom': 0}```
It is inconsistent sometimes the vae is fast, other times it takes almost as long as the whole generation.
here is after a restart with "cached config" unchecked.
```14:01:59-877365 INFO Base: class=StableDiffusionXLPipeline
14:01:59-879361 DEBUG Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
14:01:59-880862 DEBUG Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
14:01:59-882861 DEBUG Torch generator: device=cuda seeds=[1523340005]
14:01:59-883862 DEBUG Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 10, 'eta': 1.0,
'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
'height': 1024, 'parser': 'Full parser'}
Progress ?it/s 0% 0/10 00:00 ? Base14:02:00-413434 DEBUG Server: alive=True jobs=0 requests=352 uptime=313 memory=1.91/63.9 backend=Backend.DIFFUSERS
state=idle
Progress 1.19it/s █████████████████████████████████ 100% 10/10 00:08 00:00 Base
14:02:08-625569 DEBUG GC: utilization={'gpu': 66, 'ram': 3, 'threshold': 80} gc={'collected': 393, 'saved': 0.0}
before={'gpu': 7.9, 'ram': 1.91} after={'gpu': 7.9, 'ram': 1.91, 'retries': 0, 'oom': 0}
device=cuda fn=full_vae_decode time=0.22
14:02:12-525005 TRACE VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
('mid_block_add_attention', True), ('_use_default_values', ['use_quant_conv', 'latents_mean',
'mid_block_add_attention', 'use_post_quant_conv', 'latents_std', 'shift_factor']),
('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.21.0.dev0'), ('_name_or_path',
'/home/patrick/.cache/huggingface/hub/models--lykon-models--dreamshaper-8/snapshots/7e855e3f481
832419503d1fa18d4a4379597f04b/vae')])
14:02:12-528508 TRACE VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 4344250368, 'total':
12884377600, 'active': 2673, 'active_peak': 8134968320, 'reserved': 7390363648,
'reserved_peak': 8545894400, 'used': 8540127232})
14:02:12-530504 TRACE VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=4.116
14:02:12-538504 DEBUG Profile: VAE decode: 4.13
14:02:12-582516 INFO Save: image="outputs\text\06727-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
type=JPEG width=1024 height=1024 size=166048
14:02:12-584518 INFO Processed: images=1 time=12.72 its=0.79 memory={'ram': {'used': 1.92, 'total': 63.9}, 'gpu':
{'used': 7.95, 'total': 12.0}, 'retries': 0, 'oom': 0}
14:03:31-516232 INFO Base: class=StableDiffusionXLPipeline
14:03:31-518232 DEBUG Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
14:03:31-520232 DEBUG Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
14:03:31-522235 DEBUG Torch generator: device=cuda seeds=[434621457]
14:03:31-523232 DEBUG Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 10, 'eta': 1.0,
'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
'height': 1024, 'parser': 'Full parser'}
Progress 1.19it/s █████████████████████████████████ 100% 10/10 00:08 00:00 Base
14:03:40-225062 DEBUG GC: utilization={'gpu': 66, 'ram': 3, 'threshold': 80} gc={'collected': 399, 'saved': 0.0}
before={'gpu': 7.9, 'ram': 1.9} after={'gpu': 7.9, 'ram': 1.9, 'retries': 0, 'oom': 0}
device=cuda fn=full_vae_decode time=0.22
14:03:44-130741 TRACE VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
('mid_block_add_attention', True), ('_use_default_values', ['use_quant_conv', 'latents_mean',
'mid_block_add_attention', 'use_post_quant_conv', 'latents_std', 'shift_factor']),
('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.21.0.dev0'), ('_name_or_path',
'/home/patrick/.cache/huggingface/hub/models--lykon-models--dreamshaper-8/snapshots/7e855e3f481
832419503d1fa18d4a4379597f04b/vae')])
14:03:44-134241 TRACE VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 4344250368, 'total':
12884377600, 'active': 2673, 'active_peak': 8134968320, 'reserved': 7390363648,
'reserved_peak': 8545894400, 'used': 8540127232})
14:03:44-136242 TRACE VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=4.121
14:03:44-144242 DEBUG Profile: VAE decode: 4.13
14:03:44-186753 INFO Save: image="outputs\text\06728-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
type=JPEG width=1024 height=1024 size=139766
14:03:44-189254 INFO Processed: images=1 time=12.69 its=0.79 memory={'ram': {'used': 1.92, 'total': 63.9}, 'gpu':
{'used': 7.95, 'total': 12.0}, 'retries': 0, 'oom': 0}```
i can see some the difference with vs without config: 8.1gb vs 9.6gb but i also see absolutely zero differences in the config itself. and there is no proof of vram spike above 12gb as originally reported.
also, no matter what i do, i cannot reproduce this. if someone has an idea or is able to reproduce separately, i'm really curious.
I did a fresh installation and the issue persisted. I realized that the configs are cached in users/myusername/.cache/huggingface. I just deleted all of that, but are there any other shared locations for cached data to hide that might be contributing to my problem?
downloaded config is in users/myusername/.cache/huggingface
if you use "cached config" option, its exactly so this download is not required and it will use config in configs/ (for sdxl, it would be configs/sdxl)
also, you say that issue persists - but none of the logs you've uploaded with SD_VAE_DEBUG enabled show the spike above 10gb.
(had to delete my prior comment, formatting got jumbled)
My vram usage spikes above 10 gb per task manager and the webui under the preview image (labeled as GPU active). Vram usage is a bit inconsistent overall, there's probably some GC tweaking that I need to do.
My hunch is that vae tiling isn't being applied, but that's based only on the pattern I see. Vram usage is identical with it on or off when using the cached configuration. Let me know if there's anything else I can try.
| SDNext Dev Branch | ||||
|---|---|---|---|---|
| RTX 3060 12gb | Driver 555.99 | |||
| Windows 11 Pro 23H2 | Torch 2.4.1+cu124 | |||
| 64 gb DDR4 3600mhz cl 18 | RainponyXL | |||
| Ryzen 5700x3d | sdxl fp16 fixed vae | |||
| 3 Run averages, 1024x Resolution | ||||
| Cached Config Off | Cached config On | |||
| Vae Tiling | On | Off | On | Off |
| Vae decode (secs) | 3.08 | 3.67 | 3.89 | 3.91 |
| Active | 8103 | 10983 | 11001 | 10935 |
| reserved | 7368 | 7357 | 7476 | 7410 |
| used | 8452 | 8444 | 8560 | 8494 |
| free | 3836 | 3844 | 3728 | 3794 |
ah, i may have found it. seems like vae was not typecast to fp16 if config was specified. so even if upcast is disabled, its pointless since its loaded as fp32.
update and try to reproduce. if issue persists, update here and i'll reopen.
and upload full log for both runs with and without config.
before running test, set env variable SD_VAE_DEBUG=true
cached config OFF.log cached config ON.log
Issue still persists. I've attached screenshots of the webui generation info + screenshots of task manager during each run. Cached config uses significantly more vram and starts using shared memory.
I used a fresh instance of sdnext dev without extensions. I ran 2 generations and attached the logs with --debug and sd_vae_debug=true env variable.
i've reopened if someone wants to take a shot at it. i consider this very low priority since its not reproducible AND workaround is well known.
my system spiked twice and crashed my system, just saying hes not the only one. when i tested same model in invoke system stayed stable.
my system spiked twice and crashed my system, just saying hes not the only one. when i tested same model in invoke system stayed stable.
general statements without logs or any info on platform or settings are not helpful.
https://github.com/vladmandic/automatic/discussions/3471 couldnt find any info to any of those questions and i feel like the answer to those holds some info related to this answer those questions in detail and well come back to this
#3471 couldnt find any info to any of those questions and i feel like the answer to those holds some info related to this answer those questions in detail and well come back to this
that item not related at all.
there is an issue with how your system is handling diffusers.
there is an issue with how your system is handling diffusers.
maybe there is. create an issue and document it. do not post random comments on completely unrelated issues.
I have the same issue when generating images with sdxl. Could you please add an argument like --cpu-vae to make only the vae run on cpu in order to prevent the disappointing oom at the end of generation? I think this would solve the problem completely,despite of the slow decode work. But this maybe a rescue for my machine which has only 4gb of vram. 😂
this should no longer be an issue with improvements to balanced offloading.
if issue persists, please run with --monitor 3 and upload full log here and i'll reopen.
this should no longer be an issue with improvements to balanced offloading. if issue persists, please run with
--monitor 3and upload full log here and i'll reopen.
was i wrong, btw , i was trying to help, i am a fine artist and i was starting out back then (((sometimes an extra set of eyes))), i know alot more now, you can remove downvote : ) ( ^_^)o自自o(^_^ )