[Bug] regression with sd1.5 model & specific LoRAs

Open rene-descartes2021 opened this issue 2 weeks ago • 1 comments

Git commit

$ git rev-parse HEAD bfbb9297900a0ba34d651337455baf6553c20d4d

I did verify this specific commit introduces the crash/abort. The commit before does not.

Operating System & Version

Android/Termux kernel 4.14.276-g6ef255005cea-ab9062920

GGML backends

CPU

Command-line arguments used

sd -W 512 -H 512 -p "<lora:SDXL:0.6> a pony" -m realcartoonRealistic_v17.safetensors --lora-model-dir ~/x/LoRAs/sd1.5/

Steps to reproduce

Seems to depend on LoRA. SDXL crash. PCM and TCD speedup LoRAs don't crash. Other LoRAs I tried are hit/miss.

Here is link to SDXL LoRA. Here is link to model, an fp16 pruned sd1.5 checkpoint. Version V17 from that page, the 1.99GB one.

Here is my compile script. I used OpenBLAS with GGML. Altogether my adjustments seem about 5% quicker in my case.

~~Also for some reason the -mcpu=native plus their add-ons don't compile with an f16 extension (as seen with clang -mcpu=native+blah+no-blah --print-enabled-extensions) so I explicitly specify cortex-a75 via flags and patched their logic a bit, was gonna post a bug & patch to GGML eventually.~~ EDIT: couldn't reproduce again, not sure how/why I first observed this. Looks ok now: diff <(clang -mcpu=native+dotprod+noi8mm+nosve+nosme --print-enabled-extensions /dev/null) <(clang -mcpu=cortex-a75 --print-enabled-extensions /dev/null) dunno what I did wrong originally.

So well, might be one of my adjustments. I'll try stock without OpenBLAS or this other stuff in a bit.

myflags="-mcpu=cortex-a75 -ffast-math -fno-finite-math-only"
git pull --ff-only #|| exit
git submodule update #|| exit
mkdir -p build || exit
cd build || exit
cmake .. -DCMAKE_BUILD_TYPE=Release -DGGML_OPENBLAS=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DCMAKE_C_FLAGS="$myflags" -DCMAKE_CXX_FLAGS="$myflags" || exit
cmake --build . --config Release

What you expected to happen

Works like before

What actually happened

Crash on failed ASSERT after loading LoRA:

[DEBUG] lora.hpp:93   - finished loaded lora
/data/data/com.termux/files/home/dev/llm/sd/stable-diffusion.cpp/ggml_extend.hpp:1389: GGML_ASSERT(tensor->type == GGML_TYPE_F32 || tensor->type == GGML_TYPE_F16 || tensor->type == GGML_TYPE_I32) failed

Logs / error messages / stack trace

Here is full stdout and stderr of above command with '-v' parameter:

System Info: 
    SSE3 = 0 |     AVX = 0 |     AVX2 = 0 |     AVX512 = 0 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 0 |     NEON = 1 |     ARM_FMA = 1 |     F16C = 0 |     FP16_VA = 1 |     WASM_SIMD = 0 |     VSX = 0 | SDCliParams {
  mode: img_gen,
  output_path: "output.png",
  verbose: true,
  color: false,
  canny_preprocess: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false
}
SDContextParams {
  n_threads: 4,
  model_path: "realcartoonRealistic_v17.safetensors",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: "",
  llm_vision_path: "",
  diffusion_model_path: "",
  high_noise_diffusion_model_path: "",
  vae_path: "",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: "/data/data/com.termux/files/home/x/LoRAs/sd1.5/",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  flow_shift: INF
  offload_params_to_cpu: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  diffusion_flash_attn: false,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  chroma_use_dit_mask: true,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
  force_sdxl_vae_conv_scale: false
}
SDGenerationParams {
  prompt: "<lora:SDXL:0.6> a pony",
  negative_prompt: "",
  clip_skip: -1,
  width: 512,
  height: 512,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  easycache_option: "",
  easycache: disabled (threshold=1.75162e-43, start=0, end=0),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
}
[DEBUG] stable-diffusion.cpp:189  - Using CPU backend
[INFO ] stable-diffusion.cpp:227  - loading model from 'realcartoonRealistic_v17.safetensors'
[INFO ] model.cpp:373  - load realcartoonRealistic_v17.safetensors using safetensors format
[DEBUG] model.cpp:503  - init from 'realcartoonRealistic_v17.safetensors', prefix = ''
[INFO ] stable-diffusion.cpp:311  - Version: SD 1.x 
[INFO ] stable-diffusion.cpp:339  - Weight type stat:                      f16: 1130 
[INFO ] stable-diffusion.cpp:340  - Conditioner weight type stat:          f16: 196  
[INFO ] stable-diffusion.cpp:341  - Diffusion model weight type stat:      f16: 686  
[INFO ] stable-diffusion.cpp:342  - VAE weight type stat:                  f16: 248  
[DEBUG] stable-diffusion.cpp:344  - ggml tensor size = 400 bytes
[DEBUG] clip.hpp:171  - vocab size: 49408
[DEBUG] clip.hpp:182  - trigger word img already in vocab
[DEBUG] ggml_extend.hpp:1877 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1877 - unet params backend buffer size =  1640.25 MB(RAM) (686 tensors)
[DEBUG] ggml_extend.hpp:1877 - vae params backend buffer size =  94.47 MB(RAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:676  - loading weights
[DEBUG] model.cpp:1348 - using 4 threads for model loading
[DEBUG] model.cpp:1370 - loading tensors from realcartoonRealistic_v17.safetensors

  |>                                                 | 2/1130 - 2000.00it/s[K
  |=====>                                            | 124/1130 - 616.92it/s[K
  |============>                                     | 287/1130 - 715.71it/s[K
  |======================>                           | 499/1130 - 830.28it/s[K
  |======================>                           | 502/1130 - 625.94it/s[K
  |==========================>                       | 595/1130 - 593.81it/s[K
  |=============================>                    | 660/1130 - 549.08it/s[K
  |==============================>                   | 692/1130 - 493.58it/s[K
  |===============================>                  | 709/1130 - 442.57it/s[K
  |=================================>                | 748/1130 - 415.09it/s[K
  |=================================>                | 764/1130 - 381.62it/s[K
  |=====================================>            | 854/1130 - 387.83it/s[K
  |======================================>           | 866/1130 - 360.53it/s[K
  |=======================================>          | 884/1130 - 339.61it/s[K
  |========================================>         | 916/1130 - 326.79it/s[K
  |=========================================>        | 942/1130 - 313.69it/s[K
  |===========================================>      | 974/1130 - 304.09it/s[K
  |=============================================>    | 1028/1130 - 302.09it/s[K
  |==================================================| 1130/1130 - 313.54it/s[K
[INFO ] model.cpp:1577 - loading tensors completed, taking 3.61s (process: 0.00s, read: 3.57s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] stable-diffusion.cpp:703  - finished loaded file
[INFO ] stable-diffusion.cpp:775  - total params memory size = 1969.78MB (VRAM 0.00MB, RAM 1969.78MB): text_encoders 235.06MB(RAM), diffusion_model 1640.25MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:832  - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3139 - generate_image 512x512
[INFO ] stable-diffusion.cpp:3170 - sampling using Euler A method
[INFO ] denoiser.hpp:364  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3283 - TXT2IMG
[DEBUG] stable-diffusion.cpp:1134 - lora SDXL:0.60
[INFO ] stable-diffusion.cpp:969  - apply lora immediately
[INFO ] stable-diffusion.cpp:975  - attempting to apply 1 LoRAs
[INFO ] model.cpp:373  - load /data/data/com.termux/files/home/x/LoRAs/sd1.5/SDXL.safetensors using safetensors format
[DEBUG] model.cpp:503  - init from '/data/data/com.termux/files/home/x/LoRAs/sd1.5/SDXL.safetensors', prefix = 'lora.'
[INFO ] lora.hpp:40   - loading LoRA from '/data/data/com.termux/files/home/x/LoRAs/sd1.5/SDXL.safetensors'
[DEBUG] model.cpp:1348 - using 4 threads for model loading
[DEBUG] model.cpp:1370 - loading tensors from /data/data/com.termux/files/home/x/LoRAs/sd1.5/SDXL.safetensors

  |=======>                                          | 153/1050 - 38250.00it/s[K
  |==================================================| 1050/1050 - 5097.09it/s[K
[INFO ] model.cpp:1577 - loading tensors completed, taking 0.21s (process: 0.01s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1877 - lora params backend buffer size =  172.55 MB(RAM) (1050 tensors)
[DEBUG] model.cpp:1348 - using 4 threads for model loading
[DEBUG] model.cpp:1370 - loading tensors from /data/data/com.termux/files/home/x/LoRAs/sd1.5/SDXL.safetensors

  |=======================>                          | 490/1050 - 2437.81it/s[K
  |=============================================>    | 965/1050 - 2406.48it/s[K
  |==================================================| 1050/1050 - 1732.67it/s[K
[INFO ] model.cpp:1577 - loading tensors completed, taking 0.61s (process: 0.01s, read: 0.42s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] lora.hpp:93   - finished loaded lora
/data/data/com.termux/files/home/dev/llm/sd/stable-diffusion.cpp/ggml_extend.hpp:1389: GGML_ASSERT(tensor->type == GGML_TYPE_F32 || tensor->type == GGML_TYPE_F16 || tensor->type == GGML_TYPE_I32) failed
0: 0x55f10b7048 
1: 0x55f10b7004 
2: 0x55f10ca404 
3: 0x55f0f889bc 
4: 0x55f0f88138 
5: 0x55f0f8b6f4 
6: 0x55f0eefecc 
7: 0x55f0eefc88 
8: 0x55f0eef40c 
9: 0x55f0ebf130 
10: 0x55f0f83228 
11: 0x55f0ebedb8 
12: 0x55f0eaa444 
13: 0x55f0eaf158 
14: 0x55f0e23134 
15: 0x7d34cb00f8 __libc_init

Additional context / environment details

Written above. Let me know if more details needed.

Dec 11 '25 20:12 rene-descartes2021

Reproduced on ROCm. I'll prepare a patch.

Dec 12 '25 00:12 wbruna