candle Quantized Flux not working

hi. gettin an error. on my ADA RTX 4000 machine that supports BF16 and that runs Stable Diffusion just fine. I get an error on the quantized FLUX update.

running with no model specified or dev or schnell

cargo run --features cuda,cudnn --example flux -r --  --height 1024 --width 1024     --prompt "a rusty robot walking on a beach holding a small torch, the robot has the word "rust" written on it, high quality, 4k" --model dev

error

Tensor[[1, 256], u32, cuda:0]
Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading is_u32_bf16

any ideas?

Sep 28 '24 04:09 donkey-donkey

This error is most likely not due to the model itself but rather to the cuda setup. The bf16 kernels are predicated by the following line:

#if __CUDA_ARCH__ >= 800
...
#endif

This makes the kernels only available when the cuda arch set up by the nvcc compiler is above 8 so it's likely not the case in your setup. It would be interesting to see which value __CUDA_ARCH__ has in your case, as well as the output of the nvidia-smi --query-gpu=compute_cap --format=csv command.

Sep 28 '24 16:09 LaurentMazare

this machine has 2 GPUs. When I run the Stable diffusion Examples it uses the ADA 4000 with a 8.9 compute cap.

$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
6.1
8.9

how do i see the value of CUDA_ARCH

Sep 30 '24 02:09 donkey-donkey

That first gpu is most likely creating the issue, did you trying using CUDA_VISIBLE_DEVICES so that candle can only see the second gpu (if you're not familiar with it, it's not a candle specific thing so you can just google to find the way to use it).

Sep 30 '24 07:09 LaurentMazare

When CUDA_VISIBLE_DEVICES is set to the correct device and i can see that the correct GPU is used in nvidia-smi -l 1 realtime monitoring it gets up to about 9GB of memory used and then the same error happens in the middle of the image process

    Running `target/release/examples/flux --height 1024 --width 1024 --prompt 'a rusty robot walking on a beach holding a small torch, the robot has the word rust written on it, high quality, 4k' --quantized`
[[    3,     9,     3,  9277,    63,  7567,  3214,    30,     3,     9,  2608,
   3609,     3,     9,   422, 26037,     6,     8,  7567,    65,     8,  1448,
      3,  9277,  1545,    30,    34,     6,   306,   463,     6,   314,   157,
      1,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
      0,     0,     0]]
Tensor[[1, 256], u32, cuda:0]
Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading is_u32_bf16

Sep 30 '24 18:09 AlpineVibrations

Probably good to clean your target directory in case there are some PTX files that are cached and didn't rebuild after setting CUDA_VISIBLE_DEVICES.

Sep 30 '24 18:09 LaurentMazare

I did a clean and its the same error. I can monitor it in nvidia-smi which card is being used. The older card simply runs out of memory right away.

not sure where to look next.

Oct 01 '24 16:10 AlpineVibrations

Hum seems weird that candle can use the older card if CUDA_VISIBLE_DEVICES points only at the new one, it's supposed to be handled in the cuda framework and so not something that candle could bypass. Maybe you're pointing at the wrong device somehow? Another option would be in the code to point at the cuda device 1 rather than the cuda device 0.

Oct 01 '24 17:10 LaurentMazare

Actually it is pointing to the right card. it's using the correct card. CUDA_VISIBLE_DEVICES works as it should, there is no problem there. it's using the correct card and crashing. again there is no problem in choosing the correct card. As I said above I can monitor it in nviida-smi and it is using the right card. On the correct card it still crashes with the error:

Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading is_u32_bf16

Oct 02 '24 14:10 AlpineVibrations

just checking back in. no idea how to troubleshoot this. so it works on the A100 but still getting the same error on my ADA RTX 4000 with 20GB and compute cap 89

DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading is_u32_bf16

and on Mac M1 we get error

  Error while loading function: "Function 'cast_f32_bf16' does not exist"))

Oct 25 '24 23:10 AlpineVibrations

candle candle copied to clipboard

Quantized Flux not working

candle
candle copied to clipboard