automatic icon indicating copy to clipboard operation
automatic copied to clipboard

[Issue]: High VRAM usage during Vae step

Open zaxwashere opened this issue 1 year ago • 16 comments

Issue Description

Vram usage during the VAE step is inconsistent and will spike to >12 gb for an sdxl model. This is atypical for my usage, where an SDXL model will stay at 10gb or less during the vae step with my settings all applied:

  • fp16 mode, vae slicing and vae tiling = true, vae upcast = false 1024x1024 10 steps dpm ++2m, sdxl timestep presets used, cfg = 3, no attention guidance, no loras applied.

Disabling "use cached model config when available" removes the issue, and generation speeds will be 8 -10 seconds.

VRAM usage in the console does not reflect the usage as seen in task manager or in the webui, attached is a screenshot of the vram usage during a run. sdnext (1).log

image

Version Platform Description

13:30:50-670748 INFO Logger: file="C:\Users\zaxof\OneDrive\Documents\GitHub\nvidia_sdnext\sdnext.log" level=DEBUG size=65 mode=create 13:30:50-672246 INFO Python version=3.10.6 platform=Windows bin="C:\Users\zaxof\OneDrive\Documents\GitHub\nvidia_sdnext\venv\Scripts\python.exe" venv="C:\Users\zaxof\OneDrive\Documents\GitHub\nvidia_sdnext\venv" 13:30:50-859782 INFO Version: app=sd.next updated=2024-09-10 hash=91bdd3b3 branch=dev url=https://github.com/vladmandic/automatic.git/tree/dev ui=dev 13:30:51-186334 INFO Updating main repository 13:30:52-008006 INFO Upgraded to version: 91bdd3b3 Tue Sep 10 19:20:49 2024 +0300 13:30:52-015505 INFO Platform: arch=AMD64 cpu=AMD64 Family 25 Model 33 Stepping 2, AuthenticAMD system=Windows release=Windows-10-10.0.22631-SP0 python=3.10.6 13:30:52-017006 DEBUG Setting environment tuning 13:30:52-018506 INFO HF cache folder: C:\Users\zaxof.cache\huggingface\hub 13:30:52-019506 DEBUG Torch allocator: "garbage_collection_threshold:0.80,max_split_size_mb:512" 13:30:52-026016 DEBUG Torch overrides: cuda=False rocm=False ipex=False diml=False openvino=False 13:30:52-027513 DEBUG Torch allowed: cuda=True rocm=True ipex=True diml=True openvino=True 13:30:52-037517 INFO nVidia CUDA toolkit detected: nvidia-smi present

Extensions : Extensions all: ['a1111-sd-webui-tagcomplete', 'adetailer', 'OneButtonPrompt', 'sd-civitai-browser-plus_fix', 'sd-webui-infinite-image-browsing', 'sd-webui-inpaint-anything', 'sd-webui-prompt-all-in-one']

Windows 11, RTX 3060 12gb, 5700x3d, 64gb ddr4, dev branch SDNEXT, firefox browser on desktop, chrome on android for remote access.

Relevant log output

No response

Backend

Diffusers

UI

Standard

Branch

Dev

Model

StableDiffusion XL

Acknowledgements

  • [X] I have read the above and searched for existing issues
  • [X] I confirm that this is classified correctly and its not an extension issue

zaxwashere avatar Sep 10 '24 17:09 zaxwashere

i cannot reproduce. i've added some extra logging, please set env variable SD_VAE_DEBUG=true and run. note that sdnext should be restarted after changing use cached model config. post logs starting with TRACE for both runs.

vladmandic avatar Sep 11 '24 12:09 vladmandic

re-ran it with the env variable active.

13:51:08-388186 DEBUG    Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
                         'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
                         'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
                         'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
                         'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
13:51:08-390185 DEBUG    Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
13:51:08-392183 DEBUG    Torch generator: device=cuda seeds=[1158966623]
13:51:08-393184 DEBUG    Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
                         set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
                         torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
                         'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
                         torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 20, 'eta': 1.0,
                         'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
                         'height': 1024, 'parser': 'Full parser'}
Progress  1.49it/s █████████████████████████████████ 100% 10/10 00:06 00:00 Base
13:51:15-342317 DEBUG    GC: utilization={'gpu': 71, 'ram': 3, 'threshold': 80} gc={'collected': 386, 'saved': 0.66}
                         before={'gpu': 8.55, 'ram': 2.16} after={'gpu': 7.89, 'ram': 2.16, 'retries': 0, 'oom': 0}
                         device=cuda fn=full_vae_decode time=0.22
13:51:15-857910 TRACE    VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
                         ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
                         ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
                         'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
                         ('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
                         ('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
                         None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
                         ('mid_block_add_attention', True), ('_use_default_values', ['latents_std',
                         'use_post_quant_conv', 'mid_block_add_attention', 'latents_mean', 'shift_factor',
                         'use_quant_conv']), ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.20.0.dev0'),
                         ('_name_or_path', '../sdxl-vae/')])
13:51:15-861410 TRACE    VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 137363456, 'total':
                         12884377600, 'active': 2672, 'active_peak': 9631253504, 'reserved': 11605639168,
                         'reserved_peak': 11725176832, 'used': 12747014144})
13:51:15-863413 TRACE    VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
                         upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=0.741
13:51:16-051942 DEBUG    Profile: VAE decode: 0.93
13:51:16-298983 DEBUG    GC: utilization={'gpu': 99, 'ram': 3, 'threshold': 80} gc={'collected': 254, 'saved': 3.97}
                         before={'gpu': 11.87, 'ram': 2.16} after={'gpu': 7.9, 'ram': 2.16, 'retries': 0, 'oom': 0}
                         device=cuda fn=vae_decode time=0.25
13:51:16-343487 INFO     Save: image="outputs\text\06720-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
                         type=JPEG width=1024 height=1024 size=133251
13:51:16-345488 INFO     Processed: images=1 time=7.97 its=1.25 memory={'ram': {'used': 2.16, 'total': 63.9}, 'gpu':
                         {'used': 7.9, 'total': 12.0}, 'retries': 0, 'oom': 0}
13:51:22-375787 INFO     Base: class=StableDiffusionXLPipeline
13:51:22-377290 DEBUG    Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
                         'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
                         'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
                         'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
                         'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
13:51:22-379288 DEBUG    Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
13:51:22-381287 DEBUG    Torch generator: device=cuda seeds=[2858245960]
13:51:22-382287 DEBUG    Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
                         set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
                         torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
                         'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
                         torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 20, 'eta': 1.0,
                         'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
                         'height': 1024, 'parser': 'Full parser'}
Progress  1.19it/s █████████████████████████████████ 100% 10/10 00:08 00:00 Base
13:51:31-083449 DEBUG    GC: utilization={'gpu': 71, 'ram': 3, 'threshold': 80} gc={'collected': 385, 'saved': 0.57}
                         before={'gpu': 8.46, 'ram': 2.16} after={'gpu': 7.89, 'ram': 2.16, 'retries': 0, 'oom': 0}
                         device=cuda fn=full_vae_decode time=0.22
13:51:36-590122 TRACE    VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
                         ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
                         ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
                         'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
                         ('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
                         ('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
                         None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
                         ('mid_block_add_attention', True), ('_use_default_values', ['latents_std',
                         'use_post_quant_conv', 'mid_block_add_attention', 'latents_mean', 'shift_factor',
                         'use_quant_conv']), ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.20.0.dev0'),
                         ('_name_or_path', '../sdxl-vae/')])
13:51:36-593621 TRACE    VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 3857711104, 'total':
                         12884377600, 'active': 2672, 'active_peak': 9631253504, 'reserved': 7885291520,
                         'reserved_peak': 11366563840, 'used': 9026666496})
13:51:36-595622 TRACE    VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
                         upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=5.727
13:51:36-606121 DEBUG    Profile: VAE decode: 5.74
13:51:36-646633 INFO     Save: image="outputs\text\06721-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
                         type=JPEG width=1024 height=1024 size=150607
13:51:36-648635 INFO     Processed: images=1 time=14.29 its=0.70 memory={'ram': {'used': 2.16, 'total': 63.9}, 'gpu':
                         {'used': 8.41, 'total': 12.0}, 'retries': 0, 'oom': 0}```
                         
It is inconsistent sometimes the vae is fast, other times it takes almost as long as the whole generation.


here is after a restart with "cached config" unchecked.


```14:01:59-877365 INFO     Base: class=StableDiffusionXLPipeline
14:01:59-879361 DEBUG    Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
                         'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
                         'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
                         'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
                         'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
14:01:59-880862 DEBUG    Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
14:01:59-882861 DEBUG    Torch generator: device=cuda seeds=[1523340005]
14:01:59-883862 DEBUG    Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
                         set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
                         torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
                         'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
                         torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 10, 'eta': 1.0,
                         'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
                         'height': 1024, 'parser': 'Full parser'}
Progress ?it/s                                              0% 0/10 00:00 ? Base14:02:00-413434 DEBUG    Server: alive=True jobs=0 requests=352 uptime=313 memory=1.91/63.9 backend=Backend.DIFFUSERS
                         state=idle
Progress  1.19it/s █████████████████████████████████ 100% 10/10 00:08 00:00 Base
14:02:08-625569 DEBUG    GC: utilization={'gpu': 66, 'ram': 3, 'threshold': 80} gc={'collected': 393, 'saved': 0.0}
                         before={'gpu': 7.9, 'ram': 1.91} after={'gpu': 7.9, 'ram': 1.91, 'retries': 0, 'oom': 0}
                         device=cuda fn=full_vae_decode time=0.22
14:02:12-525005 TRACE    VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
                         ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
                         ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
                         'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
                         ('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
                         ('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
                         None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
                         ('mid_block_add_attention', True), ('_use_default_values', ['use_quant_conv', 'latents_mean',
                         'mid_block_add_attention', 'use_post_quant_conv', 'latents_std', 'shift_factor']),
                         ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.21.0.dev0'), ('_name_or_path',
                         '/home/patrick/.cache/huggingface/hub/models--lykon-models--dreamshaper-8/snapshots/7e855e3f481
                         832419503d1fa18d4a4379597f04b/vae')])
14:02:12-528508 TRACE    VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 4344250368, 'total':
                         12884377600, 'active': 2673, 'active_peak': 8134968320, 'reserved': 7390363648,
                         'reserved_peak': 8545894400, 'used': 8540127232})
14:02:12-530504 TRACE    VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
                         upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=4.116
14:02:12-538504 DEBUG    Profile: VAE decode: 4.13
14:02:12-582516 INFO     Save: image="outputs\text\06727-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
                         type=JPEG width=1024 height=1024 size=166048
14:02:12-584518 INFO     Processed: images=1 time=12.72 its=0.79 memory={'ram': {'used': 1.92, 'total': 63.9}, 'gpu':
                         {'used': 7.95, 'total': 12.0}, 'retries': 0, 'oom': 0}
14:03:31-516232 INFO     Base: class=StableDiffusionXLPipeline
14:03:31-518232 DEBUG    Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
                         'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
                         'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
                         'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
                         'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
14:03:31-520232 DEBUG    Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
14:03:31-522235 DEBUG    Torch generator: device=cuda seeds=[434621457]
14:03:31-523232 DEBUG    Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
                         set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
                         torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
                         'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
                         torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 10, 'eta': 1.0,
                         'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
                         'height': 1024, 'parser': 'Full parser'}
Progress  1.19it/s █████████████████████████████████ 100% 10/10 00:08 00:00 Base
14:03:40-225062 DEBUG    GC: utilization={'gpu': 66, 'ram': 3, 'threshold': 80} gc={'collected': 399, 'saved': 0.0}
                         before={'gpu': 7.9, 'ram': 1.9} after={'gpu': 7.9, 'ram': 1.9, 'retries': 0, 'oom': 0}
                         device=cuda fn=full_vae_decode time=0.22
14:03:44-130741 TRACE    VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
                         ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
                         ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
                         'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
                         ('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
                         ('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
                         None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
                         ('mid_block_add_attention', True), ('_use_default_values', ['use_quant_conv', 'latents_mean',
                         'mid_block_add_attention', 'use_post_quant_conv', 'latents_std', 'shift_factor']),
                         ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.21.0.dev0'), ('_name_or_path',
                         '/home/patrick/.cache/huggingface/hub/models--lykon-models--dreamshaper-8/snapshots/7e855e3f481
                         832419503d1fa18d4a4379597f04b/vae')])
14:03:44-134241 TRACE    VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 4344250368, 'total':
                         12884377600, 'active': 2673, 'active_peak': 8134968320, 'reserved': 7390363648,
                         'reserved_peak': 8545894400, 'used': 8540127232})
14:03:44-136242 TRACE    VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
                         upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=4.121
14:03:44-144242 DEBUG    Profile: VAE decode: 4.13
14:03:44-186753 INFO     Save: image="outputs\text\06728-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
                         type=JPEG width=1024 height=1024 size=139766
14:03:44-189254 INFO     Processed: images=1 time=12.69 its=0.79 memory={'ram': {'used': 1.92, 'total': 63.9}, 'gpu':
                         {'used': 7.95, 'total': 12.0}, 'retries': 0, 'oom': 0}```


   

zaxwashere avatar Sep 11 '24 18:09 zaxwashere

i can see some the difference with vs without config: 8.1gb vs 9.6gb but i also see absolutely zero differences in the config itself. and there is no proof of vram spike above 12gb as originally reported.

also, no matter what i do, i cannot reproduce this. if someone has an idea or is able to reproduce separately, i'm really curious.

vladmandic avatar Sep 11 '24 22:09 vladmandic

I did a fresh installation and the issue persisted. I realized that the configs are cached in users/myusername/.cache/huggingface. I just deleted all of that, but are there any other shared locations for cached data to hide that might be contributing to my problem?

zaxwashere avatar Sep 12 '24 13:09 zaxwashere

downloaded config is in users/myusername/.cache/huggingface if you use "cached config" option, its exactly so this download is not required and it will use config in configs/ (for sdxl, it would be configs/sdxl) also, you say that issue persists - but none of the logs you've uploaded with SD_VAE_DEBUG enabled show the spike above 10gb.

vladmandic avatar Sep 12 '24 14:09 vladmandic

(had to delete my prior comment, formatting got jumbled)

My vram usage spikes above 10 gb per task manager and the webui under the preview image (labeled as GPU active). Vram usage is a bit inconsistent overall, there's probably some GC tweaking that I need to do.

My hunch is that vae tiling isn't being applied, but that's based only on the pattern I see. Vram usage is identical with it on or off when using the cached configuration. Let me know if there's anything else I can try.

SDNext Dev Branch        
RTX 3060 12gb   Driver 555.99  
Windows 11 Pro 23H2   Torch 2.4.1+cu124  
64 gb DDR4 3600mhz cl 18   RainponyXL  
Ryzen 5700x3d   sdxl fp16 fixed vae  
         
3 Run averages, 1024x Resolution
  Cached Config Off Cached config On
Vae Tiling On Off On Off
Vae decode (secs) 3.08 3.67 3.89 3.91
Active 8103 10983 11001 10935
reserved 7368 7357 7476 7410
used 8452 8444 8560 8494
free 3836 3844 3728 3794

zaxwashere avatar Sep 12 '24 19:09 zaxwashere

ah, i may have found it. seems like vae was not typecast to fp16 if config was specified. so even if upcast is disabled, its pointless since its loaded as fp32.

update and try to reproduce. if issue persists, update here and i'll reopen. and upload full log for both runs with and without config. before running test, set env variable SD_VAE_DEBUG=true

vladmandic avatar Sep 13 '24 13:09 vladmandic

cached config OFF.log cached config ON.log

Issue still persists. I've attached screenshots of the webui generation info + screenshots of task manager during each run. Cached config uses significantly more vram and starts using shared memory.

cached config on cached config on webui info

I used a fresh instance of sdnext dev without extensions. I ran 2 generations and attached the logs with --debug and sd_vae_debug=true env variable.

cached config OFF cached config OFF webui info

zaxwashere avatar Oct 01 '24 20:10 zaxwashere

i've reopened if someone wants to take a shot at it. i consider this very low priority since its not reproducible AND workaround is well known.

vladmandic avatar Oct 01 '24 21:10 vladmandic

my system spiked twice and crashed my system, just saying hes not the only one. when i tested same model in invoke system stayed stable.

tampadesignr avatar Oct 04 '24 00:10 tampadesignr

my system spiked twice and crashed my system, just saying hes not the only one. when i tested same model in invoke system stayed stable.

general statements without logs or any info on platform or settings are not helpful.

vladmandic avatar Oct 04 '24 00:10 vladmandic

https://github.com/vladmandic/automatic/discussions/3471 couldnt find any info to any of those questions and i feel like the answer to those holds some info related to this answer those questions in detail and well come back to this

tampadesignr avatar Oct 04 '24 01:10 tampadesignr

#3471 couldnt find any info to any of those questions and i feel like the answer to those holds some info related to this answer those questions in detail and well come back to this

that item not related at all.

vladmandic avatar Oct 04 '24 01:10 vladmandic

there is an issue with how your system is handling diffusers.

tampadesignr avatar Oct 04 '24 01:10 tampadesignr

there is an issue with how your system is handling diffusers.

maybe there is. create an issue and document it. do not post random comments on completely unrelated issues.

vladmandic avatar Oct 04 '24 01:10 vladmandic

I have the same issue when generating images with sdxl. Could you please add an argument like --cpu-vae to make only the vae run on cpu in order to prevent the disappointing oom at the end of generation? I think this would solve the problem completely,despite of the slow decode work. But this maybe a rescue for my machine which has only 4gb of vram. 😂

xhy2008 avatar Feb 08 '25 05:02 xhy2008

this should no longer be an issue with improvements to balanced offloading. if issue persists, please run with --monitor 3 and upload full log here and i'll reopen.

vladmandic avatar Nov 04 '25 15:11 vladmandic

this should no longer be an issue with improvements to balanced offloading. if issue persists, please run with --monitor 3 and upload full log here and i'll reopen.

was i wrong, btw , i was trying to help, i am a fine artist and i was starting out back then (((sometimes an extra set of eyes))), i know alot more now, you can remove downvote : ) ( ^​_^)o自自o(^_​^ )

tampadesignr avatar Nov 05 '25 03:11 tampadesignr