stable-diffusion.cpp icon indicating copy to clipboard operation
stable-diffusion.cpp copied to clipboard

[Bug] wan2.2 t2v run fail with vulkan

Open wszgrcy opened this issue 1 month ago • 9 comments

Git commit

8f6c5c217b1f6f27a8aa5fb78d3390fa849fc96a version https://github.com/leejet/stable-diffusion.cpp/releases/tag/master-348-8f6c5c2

Operating System & Version

windows 10 22h2 19045.4717

GGML backends

Vulkan

Command-line arguments used

./sd.exe -M vid_gen --diffusion-model ./Wan2.2-TI2V-5B-Q8_0.gguf --vae ./wan2.2_vae.safetensors --t5xxl ./umt5-xxl-encoder-Q8_0.gguf -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作 品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG 压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 480 -H 832 --diffusion-fa --offload-to-cpu --video-frames 33 --flow-shift 3.0 -v

Steps to reproduce

  • run ./sd.exe -M vid_gen --diffusion-model ./Wan2.2-TI2V-5B-Q8_0.gguf --vae ./wan2.2_vae.safetensors --t5xxl ./umt5-xxl-encoder-Q8_0.gguf -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作 品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG 压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 480 -H 832 --diffusion-fa --offload-to-cpu --video-frames 33 --flow-shift 3.0 -v

throw error

[ERROR] ggml_extend.hpp:75 - ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 21773256964 [ERROR] ggml_extend.hpp:1588 - wan_vae: failed to allocate the compute buffer

What you expected to happen

success

What actually happened

throw error

Logs / error messages / stack trace

➜  vulkan ./sd.exe -M vid_gen --diffusion-model ./Wan2.2-TI2V-5B-Q8_0.gguf --vae ./wan2.2_vae.safetensors --t5xxl ./umt5-xxl-encoder-Q8_0.gguf -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作 品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG 压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 480 -H 832 --diffusion-fa --offload-to-cpu --video-frames 33 --flow-shift 3.0 -v
Option:
    n_threads:                         12
    mode:                              vid_gen
    model_path:
    wtype:                             unspecified
    clip_l_path:
    clip_g_path:
    clip_vision_path:
    t5xxl_path:                        ./umt5-xxl-encoder-Q8_0.gguf
    qwen2vl_path:
    qwen2vl_vision_path:
    diffusion_model_path:              ./Wan2.2-TI2V-5B-Q8_0.gguf
    high_noise_diffusion_model_path:
    vae_path:                          ./wan2.2_vae.safetensors
    taesd_path:
    esrgan_path:
    control_net_path:
    embedding_dir:
    photo_maker_path:
    pm_id_images_dir:
    pm_id_embed_path:
    pm_style_strength:                 20.00
    output_path:                       output.png
    init_image_path:
    end_image_path:
    mask_image_path:
    control_image_path:
    ref_images_paths:
    control_video_path:
    auto_resize_ref_image:             true
    increase_ref_index:                false
    offload_params_to_cpu:             true
    clip_on_cpu:                       false
    control_net_cpu:                   false
    vae_on_cpu:                        false
    diffusion flash attention:         true
    diffusion Conv2d direct:           false
    vae_conv_direct:                   false
    control_strength:                  0.90
    prompt:                            a lovely cat
    negative_prompt:                   色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作 品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG 压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走
    clip_skip:                         -1
    width:                             480
    height:                            832
    sample_params:                     (txt_cfg: 6.00, img_cfg: 6.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler, sample_steps: 20, eta: 0.00, shifted_timestep: 0)
    high_noise_sample_params:          (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: default, sample_steps: -1, eta: 0.00, shifted_timestep: 0)
    moe_boundary:                      0.875
    prediction:                        default
    flow_shift:                        3.00
    strength(img2img):                 0.75
    rng:                               cuda
    seed:                              42
    batch_count:                       1
    vae_tiling:                        false
    force_sdxl_vae_conv_scale:         false
    upscale_repeats:                   1
    chroma_use_dit_mask:               true
    chroma_use_t5_mask:                false
    chroma_t5_mask_pad:                1
    video_frames:                      33
    vace_strength:                     1.00
    fps:                               16
System Info:
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:147  - Using Vulkan backend
[DEBUG] ggml_extend.hpp:66   - ggml_vulkan: Found 1 Vulkan devices:
[DEBUG] ggml_extend.hpp:66   - ggml_vulkan: 0 = AMD Radeon RX 7900 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
[INFO ] stable-diffusion.cpp:203  - loading diffusion model from './Wan2.2-TI2V-5B-Q8_0.gguf'
[INFO ] model.cpp:1001 - load ./Wan2.2-TI2V-5B-Q8_0.gguf using gguf format
[DEBUG] model.cpp:1018 - init from './Wan2.2-TI2V-5B-Q8_0.gguf'
[ERROR] ggml_extend.hpp:75   - gguf_init_from_file_impl: tensor 'patch_embedding.weight' has invalid number of dimensions: 5 > 4
[ERROR] ggml_extend.hpp:75   - gguf_init_from_file_impl: failed to read tensor info
[ERROR] model.cpp:1027 - failed to open './Wan2.2-TI2V-5B-Q8_0.gguf' with gguf_init_from_file. Try to open it with GGUFReader.
[DEBUG] gguf_reader.hpp:198  - GGUF v3, tensor_count=825, metadata_kv_count=3
[DEBUG] model.cpp:1739 - patch_embedding_channels 147456
[INFO ] stable-diffusion.cpp:243  - loading t5xxl from './umt5-xxl-encoder-Q8_0.gguf'
[INFO ] model.cpp:1001 - load ./umt5-xxl-encoder-Q8_0.gguf using gguf format
[DEBUG] model.cpp:1018 - init from './umt5-xxl-encoder-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:264  - loading vae from './wan2.2_vae.safetensors'
[INFO ] model.cpp:1004 - load ./wan2.2_vae.safetensors using safetensors format
[DEBUG] model.cpp:1109 - init from './wan2.2_vae.safetensors', prefix = 'vae.'
[DEBUG] model.cpp:1739 - patch_embedding_channels 147456
[INFO ] stable-diffusion.cpp:285  - Version: Wan 2.2 TI2V
[INFO ] stable-diffusion.cpp:312  - Weight type stat:                      f32: 74   |     f16: 720  |    q8_0: 469
[INFO ] stable-diffusion.cpp:313  - Conditioner weight type stat:          f32: 73   |    q8_0: 169
[INFO ] stable-diffusion.cpp:314  - Diffusion model weight type stat:      f32: 1    |     f16: 524  |    q8_0: 300
[INFO ] stable-diffusion.cpp:315  - VAE weight type stat:                  f16: 196
[DEBUG] stable-diffusion.cpp:317  - ggml tensor size = 400 bytes
[INFO ] wan.hpp:2123 - Wan2.2-TI2V-5B
[INFO ] stable-diffusion.cpp:451  - Using flash attention in the diffusion model
[DEBUG] ggml_extend.hpp:1783 - t5 params backend buffer size =  5757.05 MB(RAM) (242 tensors)
[DEBUG] ggml_extend.hpp:1783 - Wan2.2-TI2V-5B params backend buffer size =  5153.43 MB(RAM) (825 tensors)
[DEBUG] ggml_extend.hpp:1783 - wan_vae params backend buffer size =  1344.24 MB(RAM) (196 tensors)
[DEBUG] stable-diffusion.cpp:592  - loading weights
[DEBUG] model.cpp:1920 - using 12 threads for model loading
[DEBUG] model.cpp:1942 - loading tensors from ./Wan2.2-TI2V-5B-Q8_0.gguf
  |================================>                 | 825/1263 - 428.57it/s
[DEBUG] model.cpp:1942 - loading tensors from ./umt5-xxl-encoder-Q8_0.gguf
  |==========================================>       | 1067/1263 - 219.91it/s
[DEBUG] model.cpp:1942 - loading tensors from ./wan2.2_vae.safetensors
  |==================================================| 1263/1263 - 230.52it/s
[INFO ] model.cpp:2151 - loading tensors completed, taking 5.48s (process: 0.00s, read: 4.12s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:690  - total params memory size = 12254.72MB (VRAM 12254.72MB, RAM 0.00MB): text_encoders 5757.05MB(VRAM), diffusion_model 5153.43MB(VRAM), vae 1344.24MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:769  - running in FLOW mode
[DEBUG] stable-diffusion.cpp:801  - finished loaded file
[INFO ] stable-diffusion.cpp:2745 - generate_video 480x832x33
[INFO ] stable-diffusion.cpp:947  - attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:967  - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:968  - prompt after extract and remove lora: "a lovely cat"
[DEBUG] conditioner.hpp:1415 - parse 'a lovely cat' to [['a lovely cat', 1], ]
[DEBUG] t5.hpp:402  - token length: 512
[INFO ] ggml_extend.hpp:1698 - t5 offload params (5757.05 MB, 242 tensors) to runtime backend (Vulkan0), taking 1.35s
[DEBUG] ggml_extend.hpp:1598 - t5 compute buffer size: 297.00 MB(VRAM)
[DEBUG] conditioner.hpp:1515 - computing condition graph completed, taking 1728 ms
[DEBUG] conditioner.hpp:1415 - parse '色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作 品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG 压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走' to [['色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作 品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG 压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走', 1], ]
[DEBUG] t5.hpp:402  - token length: 512
[INFO ] ggml_extend.hpp:1698 - t5 offload params (5757.05 MB, 242 tensors) to runtime backend (Vulkan0), taking 1.28s
[DEBUG] ggml_extend.hpp:1598 - t5 compute buffer size: 297.00 MB(VRAM)
[DEBUG] conditioner.hpp:1515 - computing condition graph completed, taking 1667 ms
[INFO ] stable-diffusion.cpp:2999 - get_learned_condition completed, taking 3412 ms
[DEBUG] stable-diffusion.cpp:3055 - sample 30x52x9
[INFO ] ggml_extend.hpp:1698 - Wan2.2-TI2V-5B offload params (5153.43 MB, 825 tensors) to runtime backend (Vulkan0), taking 1.69s
[DEBUG] ggml_extend.hpp:1598 - Wan2.2-TI2V-5B compute buffer size: 335.35 MB(VRAM)
  |==================================================| 20/20 - 4.17s/it
[INFO ] stable-diffusion.cpp:3082 - sampling completed, taking 83.52s
[INFO ] stable-diffusion.cpp:3103 - generating latent video completed, taking 84.18s
[INFO ] ggml_extend.hpp:1698 - wan_vae offload params (1344.24 MB, 196 tensors) to runtime backend (Vulkan0), taking 0.23s
ggml_vulkan: Device memory allocation of size 2760376320 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] ggml_extend.hpp:75   - ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 21773256964
[ERROR] ggml_extend.hpp:1588 - wan_vae: failed to allocate the compute buffer
[1]    1506 segmentation fault  ./sd.exe -M vid_gen --diffusion-model ./Wan2.2-TI2V-5B-Q8_0.gguf --vae    -p

Additional context / environment details

cpu 5900x gpu 7900xt (20G memory 32G share memeory)

wszgrcy avatar Nov 06 '25 01:11 wszgrcy

The VAE process is really that VRAM heavy. Try changing the resolution to -W 360 -H 640 and see if you get a valid result.

Check out #868 for additional context.

MrSnichovitch avatar Nov 06 '25 03:11 MrSnichovitch

The VAE process is really that VRAM heavy. Try changing the resolution to -W 360 -H 640 and see if you get a valid result.

Check out #868 for additional context.

Thank you for your reply. The final test found that -W 280 -H 280 can output successfully But at this time, the graphics card memory only occupies about 8g and there is no overflow of video memory. What is the problem

The following is the memory usage when using the -W 300 -H 300 parameter, without overflow, it failed directly Image

➜  vulkan ./sd.exe -M vid_gen --diffusion-model ./Wan2.2-TI2V-5B-Q8_0.gguf --vae ./wan2.2_vae.safetensors --t5xxl ./umt5-xxl-encoder-Q8_0.gguf -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作 品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG 压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 300 -H 300 --diffusion-fa --offload-to-cpu --video-frames 33 --flow-shift 3.0 -v
Option:
    n_threads:                         12
    mode:                              vid_gen
    model_path:
    wtype:                             unspecified
    clip_l_path:
    clip_g_path:
    clip_vision_path:
    t5xxl_path:                        ./umt5-xxl-encoder-Q8_0.gguf
    qwen2vl_path:
    qwen2vl_vision_path:
    diffusion_model_path:              ./Wan2.2-TI2V-5B-Q8_0.gguf
    high_noise_diffusion_model_path:
    vae_path:                          ./wan2.2_vae.safetensors
    taesd_path:
    esrgan_path:
    control_net_path:
    embedding_dir:
    photo_maker_path:
    pm_id_images_dir:
    pm_id_embed_path:
    pm_style_strength:                 20.00
    output_path:                       output.png
    init_image_path:
    end_image_path:
    mask_image_path:
    control_image_path:
    ref_images_paths:
    control_video_path:
    auto_resize_ref_image:             true
    increase_ref_index:                false
    offload_params_to_cpu:             true
    clip_on_cpu:                       false
    control_net_cpu:                   false
    vae_on_cpu:                        false
    diffusion flash attention:         true
    diffusion Conv2d direct:           false
    vae_conv_direct:                   false
    control_strength:                  0.90
    prompt:                            a lovely cat
    negative_prompt:                   鑹茶皟鑹充附锛岃繃鏇濓紝闈欐€侊紝缁嗚妭妯$硦涓嶆竻锛屽瓧骞曪紝椋庢牸锛屼綔 鍝侊紝鐢讳綔锛岀敾闈紝闈欐锛屾暣浣撳彂鐏帮紝鏈€宸川閲忥紝浣庤川閲忥紝JPEG 鍘嬬缉娈嬬暀锛屼笐闄嬬殑锛屾畫缂虹殑锛屽浣欑殑鎵嬫寚锛岀敾寰椾笉濂界殑鎵嬮儴锛岀敾寰椾笉濂界殑鑴搁儴锛岀暩褰㈢殑锛屾瘉瀹圭殑锛屽舰鎬佺暩褰㈢殑鑲綋锛屾墜鎸囪瀺鍚堬紝闈欐涓嶅姩鐨勭敾闈紝鏉備贡鐨勮儗鏅紝涓夋潯鑵匡紝鑳屾櫙浜哄緢澶氾紝鍊掔潃璧?
    clip_skip:                         -1
    width:                             300
    height:                            300
    sample_params:                     (txt_cfg: 6.00, img_cfg: 6.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler, sample_steps: 20, eta: 0.00, shifted_timestep: 0)
    high_noise_sample_params:          (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: default, sample_steps: -1, eta: 0.00, shifted_timestep: 0)
    moe_boundary:                      0.875
    prediction:                        default
    flow_shift:                        3.00
    strength(img2img):                 0.75
    rng:                               cuda
    seed:                              42
    batch_count:                       1
    vae_tiling:                        false
    force_sdxl_vae_conv_scale:         false
    upscale_repeats:                   1
    chroma_use_dit_mask:               true
    chroma_use_t5_mask:                false
    chroma_t5_mask_pad:                1
    video_frames:                      33
    vace_strength:                     1.00
    fps:                               16
System Info:
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:147  - Using Vulkan backend
[DEBUG] ggml_extend.hpp:66   - ggml_vulkan: Found 1 Vulkan devices:
[DEBUG] ggml_extend.hpp:66   - ggml_vulkan: 0 = AMD Radeon RX 7900 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
[INFO ] stable-diffusion.cpp:203  - loading diffusion model from './Wan2.2-TI2V-5B-Q8_0.gguf'
[INFO ] model.cpp:1001 - load ./Wan2.2-TI2V-5B-Q8_0.gguf using gguf format
[DEBUG] model.cpp:1018 - init from './Wan2.2-TI2V-5B-Q8_0.gguf'
[ERROR] ggml_extend.hpp:75   - gguf_init_from_file_impl: tensor 'patch_embedding.weight' has invalid number of dimensions: 5 > 4
[ERROR] ggml_extend.hpp:75   - gguf_init_from_file_impl: failed to read tensor info
[ERROR] model.cpp:1027 - failed to open './Wan2.2-TI2V-5B-Q8_0.gguf' with gguf_init_from_file. Try to open it with GGUFReader.
[DEBUG] gguf_reader.hpp:198  - GGUF v3, tensor_count=825, metadata_kv_count=3
[DEBUG] model.cpp:1739 - patch_embedding_channels 147456
[INFO ] stable-diffusion.cpp:243  - loading t5xxl from './umt5-xxl-encoder-Q8_0.gguf'
[INFO ] model.cpp:1001 - load ./umt5-xxl-encoder-Q8_0.gguf using gguf format
[DEBUG] model.cpp:1018 - init from './umt5-xxl-encoder-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:264  - loading vae from './wan2.2_vae.safetensors'
[INFO ] model.cpp:1004 - load ./wan2.2_vae.safetensors using safetensors format
[DEBUG] model.cpp:1109 - init from './wan2.2_vae.safetensors', prefix = 'vae.'
[DEBUG] model.cpp:1739 - patch_embedding_channels 147456
[INFO ] stable-diffusion.cpp:285  - Version: Wan 2.2 TI2V
[INFO ] stable-diffusion.cpp:312  - Weight type stat:                      f32: 74   |     f16: 720  |    q8_0: 469
[INFO ] stable-diffusion.cpp:313  - Conditioner weight type stat:          f32: 73   |    q8_0: 169
[INFO ] stable-diffusion.cpp:314  - Diffusion model weight type stat:      f32: 1    |     f16: 524  |    q8_0: 300
[INFO ] stable-diffusion.cpp:315  - VAE weight type stat:                  f16: 196
[DEBUG] stable-diffusion.cpp:317  - ggml tensor size = 400 bytes
[INFO ] wan.hpp:2123 - Wan2.2-TI2V-5B
[INFO ] stable-diffusion.cpp:451  - Using flash attention in the diffusion model
[DEBUG] ggml_extend.hpp:1783 - t5 params backend buffer size =  5757.05 MB(RAM) (242 tensors)
[DEBUG] ggml_extend.hpp:1783 - Wan2.2-TI2V-5B params backend buffer size =  5153.43 MB(RAM) (825 tensors)
[DEBUG] ggml_extend.hpp:1783 - wan_vae params backend buffer size =  1344.24 MB(RAM) (196 tensors)
[DEBUG] stable-diffusion.cpp:592  - loading weights
[DEBUG] model.cpp:1920 - using 12 threads for model loading
[DEBUG] model.cpp:1942 - loading tensors from ./Wan2.2-TI2V-5B-Q8_0.gguf
  |================================>                 | 825/1263 - 423.29it/s
[DEBUG] model.cpp:1942 - loading tensors from ./umt5-xxl-encoder-Q8_0.gguf
  |==========================================>       | 1067/1263 - 219.05it/s
[DEBUG] model.cpp:1942 - loading tensors from ./wan2.2_vae.safetensors
  |==================================================| 1263/1263 - 229.59it/s
[INFO ] model.cpp:2151 - loading tensors completed, taking 5.50s (process: 0.00s, read: 4.15s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:690  - total params memory size = 12254.72MB (VRAM 12254.72MB, RAM 0.00MB): text_encoders 5757.05MB(VRAM), diffusion_model 5153.43MB(VRAM), vae 1344.24MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:769  - running in FLOW mode
[DEBUG] stable-diffusion.cpp:801  - finished loaded file
[INFO ] stable-diffusion.cpp:2745 - generate_video 300x300x33
[INFO ] stable-diffusion.cpp:947  - attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:967  - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:968  - prompt after extract and remove lora: "a lovely cat"
[DEBUG] conditioner.hpp:1415 - parse 'a lovely cat' to [['a lovely cat', 1], ]
[DEBUG] t5.hpp:402  - token length: 512
[INFO ] ggml_extend.hpp:1698 - t5 offload params (5757.05 MB, 242 tensors) to runtime backend (Vulkan0), taking 1.10s
[DEBUG] ggml_extend.hpp:1598 - t5 compute buffer size: 297.00 MB(VRAM)
[DEBUG] conditioner.hpp:1515 - computing condition graph completed, taking 1488 ms
[DEBUG] conditioner.hpp:1415 - parse '鑹茶皟鑹充附锛岃繃鏇濓紝闈欐€侊紝缁嗚妭妯$硦涓嶆竻锛屽瓧骞曪紝椋庢牸锛屼綔 鍝侊紝鐢讳綔锛岀敾闈紝闈欐锛屾暣浣撳彂鐏帮紝鏈€宸川閲忥紝浣庤川閲忥紝JPEG 鍘嬬缉娈嬬暀锛屼笐闄嬬殑锛屾畫缂虹殑锛屽浣欑殑鎵嬫寚锛岀敾寰椾笉濂界殑鎵嬮儴锛岀敾寰椾笉濂界殑鑴搁儴锛岀暩褰㈢殑锛屾瘉瀹圭殑锛屽舰鎬佺暩褰㈢殑鑲綋锛屾墜鎸囪瀺鍚堬紝闈欐涓嶅姩鐨勭敾闈紝鏉備贡鐨勮儗鏅紝涓夋潯鑵匡紝鑳屾櫙浜哄緢澶氾紝鍊掔潃璧? to [['鑹茶皟鑹充附锛岃繃鏇濓紝闈欐€侊紝缁嗚妭妯$硦涓嶆竻锛屽瓧骞曪紝椋庢牸锛屼綔 鍝侊紝鐢讳綔锛岀敾闈紝闈欐锛屾暣浣撳彂鐏帮紝鏈€宸川閲忥紝浣庤川閲忥紝JPEG 鍘嬬缉娈嬬暀锛屼笐闄嬬殑锛屾畫缂虹殑锛屽浣欑殑鎵嬫寚锛岀敾寰椾笉濂界殑鎵嬮儴锛岀敾寰椾笉濂界殑鑴搁儴锛岀暩褰㈢殑锛屾瘉瀹圭殑锛屽舰鎬佺暩褰㈢殑鑲綋锛屾墜鎸囪瀺鍚堬紝闈欐涓嶅姩鐨勭敾闈紝鏉備贡鐨勮儗鏅紝涓夋潯鑵匡紝鑳屾櫙浜哄緢澶氾紝鍊掔潃璧?, 1], ]
[DEBUG] t5.hpp:402  - token length: 512
[INFO ] ggml_extend.hpp:1698 - t5 offload params (5757.05 MB, 242 tensors) to runtime backend (Vulkan0), taking 1.11s
[DEBUG] ggml_extend.hpp:1598 - t5 compute buffer size: 297.00 MB(VRAM)
[DEBUG] conditioner.hpp:1515 - computing condition graph completed, taking 1494 ms
[INFO ] stable-diffusion.cpp:2999 - get_learned_condition completed, taking 2999 ms
[DEBUG] stable-diffusion.cpp:3055 - sample 18x18x9
[INFO ] ggml_extend.hpp:1698 - Wan2.2-TI2V-5B offload params (5153.43 MB, 825 tensors) to runtime backend (Vulkan0), taking 1.44s
[DEBUG] ggml_extend.hpp:1598 - Wan2.2-TI2V-5B compute buffer size: 85.18 MB(VRAM)
  |==================================================| 20/20 - 1.47it/s
[INFO ] stable-diffusion.cpp:3082 - sampling completed, taking 13.71s
[INFO ] stable-diffusion.cpp:3103 - generating latent video completed, taking 14.13s
[INFO ] ggml_extend.hpp:1698 - wan_vae offload params (1344.24 MB, 196 tensors) to runtime backend (Vulkan0), taking 0.23s
ggml_vulkan: Device memory allocation of size 2293235712 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] ggml_extend.hpp:75   - ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 6806706052
[ERROR] ggml_extend.hpp:1588 - wan_vae: failed to allocate the compute buffer
[1]    1996 segmentation fault  ./sd.exe -M vid_gen --diffusion-model ./Wan2.2-TI2V-5B-Q8_0.gguf --vae    -p

wszgrcy avatar Nov 06 '25 03:11 wszgrcy

The latest test found that the size of 287 is normal, while the size of 288 fails. However, at 287, it only occupies 8GB of video memory at its highest point (a total of 20GB), and at 288, it directly fails

wszgrcy avatar Nov 06 '25 03:11 wszgrcy

I'm afraid I misread parts of your first post and didn't realize you were using the Vulkan backend on Windows. Definitely a mistake on my part; my apologies!

I'm currently running an older release version on Linux -- master-330-db6f479 -- and when testing with the Vulkan backend, I get the same type of failure at -W 360 -H 640. However, when running at -W 288 -H 512, I get viable output. It doesn't look great (at least without a LoRA added), but generation completes successfully. Which is one reason why I switched WAN runs to the ROCm backend on my RX 7600 XT. At the very least, I can state that WAN2.2 TI2V 5B has always been weak with Vulkan. From what I understand, it has to do with Vulkan's im2col_3d functionality lagging behind CUDA.

I don't have a Windows machine to test with, but will pull down the latest release and have a look to see if it's possibly related to Linux vs. Windows Vulkan driver issue (Linux can be much more current), or a regression in the sd code.

MrSnichovitch avatar Nov 06 '25 06:11 MrSnichovitch

Okay... So I've pulled down master-348-8f6c5c2 and compiled the Vulkan version. Testing results on my end are the same as they were with master-330: OOM failure at the VAE stage with -W 360 -H 640, but successful generation at -W 288 -H 512. The VAE stage is peaking at 13.8 GiB of VRAM usage during processing.

I'm currently running Vulkan Instance Version: 1.4.321 on this system (Manjaro Linux), which is one release behind the latest. You may want to check and see if your Vulkan SDK and/or Vulkan Runtime need to be updated, as this could be a Windows driver or library problem. Check for AMD GPU driver updates for the Runtime first before downloading the package from LunarG. Other than that, you may want to consider compiling a ROCm version of sd for use with WAN... It really does seem to run better than Vulkan.

For reference, here's a summary run of vulkaninfo on my system:

vulkaninfo --summary 

==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.321


Instance Extensions: count = 24
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_headless_surface                : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 5
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 2
--------------------------
VK_LAYER_KHRONOS_validation Khronos Validation Layer     1.4.321  version 1
VK_LAYER_MESA_device_select Linux device selection layer 1.4.303  version 1

Devices:
========
GPU0:
	apiVersion         = 1.4.318
	driverVersion      = 25.2.3
	vendorID           = 0x1002
	deviceID           = 0x7480
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = AMD Radeon RX 7600 XT (RADV NAVI33)
	driverID           = DRIVER_ID_MESA_RADV
	driverName         = radv
	driverInfo         = Mesa 25.2.3-arch1.2
	conformanceVersion = 1.4.0.0
	deviceUUID         = 00000000-2d00-0000-0000-000000000000
	driverUUID         = 414d442d-4d45-5341-2d44-525600000000

MrSnichovitch avatar Nov 06 '25 08:11 MrSnichovitch

Okay... So I've pulled down master-348-8f6c5c2 and compiled the Vulkan version. Testing results on my end are the same as they were with master-330: OOM failure at the VAE stage with -W 360 -H 640, but successful generation at -W 288 -H 512. The VAE stage is peaking at 13.8 GiB of VRAM usage during processing.

I'm currently running Vulkan Instance Version: 1.4.321 on this system (Manjaro Linux), which is one release behind the latest. You may want to check and see if your Vulkan SDK and/or Vulkan Runtime need to be updated, as this could be a Windows driver or library problem. Check for AMD GPU driver updates for the Runtime first before downloading the package from LunarG. Other than that, you may want to consider compiling a ROCm version of sd for use with WAN... It really does seem to run better than Vulkan.

For reference, here's a summary run of vulkaninfo on my system:

vulkaninfo --summary 

==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.321


Instance Extensions: count = 24
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_headless_surface                : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 5
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 2
--------------------------
VK_LAYER_KHRONOS_validation Khronos Validation Layer     1.4.321  version 1
VK_LAYER_MESA_device_select Linux device selection layer 1.4.303  version 1

Devices:
========
GPU0:
	apiVersion         = 1.4.318
	driverVersion      = 25.2.3
	vendorID           = 0x1002
	deviceID           = 0x7480
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = AMD Radeon RX 7600 XT (RADV NAVI33)
	driverID           = DRIVER_ID_MESA_RADV
	driverName         = radv
	driverInfo         = Mesa 25.2.3-arch1.2
	conformanceVersion = 1.4.0.0
	deviceUUID         = 00000000-2d00-0000-0000-000000000000
	driverUUID         = 414d442d-4d45-5341-2d44-525600000000

Thank you for your reply. I am using the version released by the repository and have not compiled it myself

vulkaninfo

WARNING: [Loader Message] Code 0 : windows_read_data_files_in_registry: Registry lookup failed to get layer manifest files.
WARNING: [Loader Message] Code 0 : Layer VK_LAYER_OBS_HOOK uses API version 1.3 which is older than the application specified API version of 1.4. May cause issues.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.313


Instance Extensions: count = 13
-------------------------------
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_swapchain_colorspace            : extension revision 5
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_win32_surface                   : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 5
--------------------------
VK_LAYER_AMD_switchable_graphics AMD switchable graphics layer                 1.4.325  version 1
VK_LAYER_EOS_Overlay             Vulkan overlay layer for Epic Online Services 1.2.136  version 1
VK_LAYER_OBS_HOOK                Open Broadcaster Software hook                1.3.216  version 1
VK_LAYER_VALVE_steam_fossilize   Steam Pipeline Caching Layer                  1.4.303  version 1
VK_LAYER_VALVE_steam_overlay     Steam Overlay Layer                           1.3.207  version 1

Devices:
========
GPU0:
        apiVersion         = 1.4.325
        driverVersion      = 2.0.364
        vendorID           = 0x1002
        deviceID           = 0x744c
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = AMD Radeon RX 7900 XT
        driverID           = DRIVER_ID_AMD_PROPRIETARY
        driverName         = AMD proprietary driver
        driverInfo         = 25.10.2 (LLPC)
        conformanceVersion = 1.4.0.0
        deviceUUID         = 00000000-2d00-0000-0000-000000000000
        driverUUID         = 414d442d-5749-4e2d-4452-560000000000

amd drivers 25.10.2

wszgrcy avatar Nov 06 '25 08:11 wszgrcy

I recently saw issues when the dims are not (foored) to multiples of 32 with cuda.

Green-Sky avatar Nov 06 '25 09:11 Green-Sky

I recently saw issues when the dims are not (foored) to multiples of 32 with cuda.

Thank you for your reply I just tested 320 * 320 and it still has the same error

wszgrcy avatar Nov 06 '25 10:11 wszgrcy

I'm certain this is a Windows Vulkan Version bug. I'm using the latest ROCM build submitted at 59ebdf0bb5b3a6c83d92ca90fd820707fb154e9d, which can normally output a 640*480 video with 20GB VRAM usage. However, when using the Vulkan version, the VRAM usage is around 8GB, and then it fails.

wszgrcy avatar Nov 12 '25 01:11 wszgrcy