candle
candle copied to clipboard
Metal Bug moving image from GPU to CPU that Hangs the whole system.
Using the stable diffusion example running SDXL on CUDA vs Metal. Creating the image on a RTX 4000 ADA using Cuda takes about 1 second per step. Creating the image on M1 with 16GB shared its about 10x slower at 16 seconds per step. Since the GPU is not fully maxed out on the Metal yet, this makes sense, however there seams to be a bug when transferring the image from the GPU back to the CPU.
On the CUDA machine it takes 0.149 seconds and on the M1 it takes anywhere from 36 seconds to 400 seconds and it completely freezes the host OS.
I made a branch with a timer in place at https://github.com/AIFX-Art/candle/tree/gpu-timing
CUDA:
cargo run --release --features coda,cudnn --example stable-diffusion -- --sd-version xl --n-steps 8 --width 1024 --height 1024 --use-f16
runs and outputs
Tensor[dims 2, 77, 2048; f16, cuda:0]
Building the autoencoder.
Building the unet.
starting sampling
step 1/8 done, 1.24s
step 2/8 done, 0.68s
step 3/8 done, 1.01s
step 4/8 done, 1.01s
step 5/8 done, 1.01s
step 6/8 done, 1.01s
step 7/8 done, 1.01s
step 8/8 done, 1.01s
Generating the final image for sample 1/1.
Image to CPU 0.14912221s
Metal:
cargo run --release --features metal --example stable-diffusion -- --sd-version xl --n-steps 8 --width 1024 --height 1024 --use-f16
runs and outputs
Tensor[dims 2, 77, 2048; f16, metal:4294969334]
Building the autoencoder.
Building the unet.
starting sampling
step 1/8 done, 3.68s
step 2/8 done, 14.28s
step 3/8 done, 16.89s
step 4/8 done, 16.22s
step 5/8 done, 16.73s
step 6/8 done, 17.44s
step 7/8 done, 16.68s
step 8/8 done, 16.81s
Generating the final image for sample 1/1.
Image to CPU 46.37571s