candle icon indicating copy to clipboard operation
candle copied to clipboard

Metal Bug moving image from GPU to CPU that Hangs the whole system.

Open super-fun-surf opened this issue 6 months ago • 3 comments

Using the stable diffusion example running SDXL on CUDA vs Metal. Creating the image on a RTX 4000 ADA using Cuda takes about 1 second per step. Creating the image on M1 with 16GB shared its about 10x slower at 16 seconds per step. Since the GPU is not fully maxed out on the Metal yet, this makes sense, however there seams to be a bug when transferring the image from the GPU back to the CPU.

On the CUDA machine it takes 0.149 seconds and on the M1 it takes anywhere from 36 seconds to 400 seconds and it completely freezes the host OS.

I made a branch with a timer in place at https://github.com/AIFX-Art/candle/tree/gpu-timing

CUDA:

cargo run --release --features coda,cudnn  --example stable-diffusion -- --sd-version xl --n-steps 8 --width 1024 --height 1024 --use-f16

runs and outputs

Tensor[dims 2, 77, 2048; f16, cuda:0]
Building the autoencoder.
Building the unet.
starting sampling
step 1/8 done, 1.24s
step 2/8 done, 0.68s
step 3/8 done, 1.01s
step 4/8 done, 1.01s
step 5/8 done, 1.01s
step 6/8 done, 1.01s
step 7/8 done, 1.01s
step 8/8 done, 1.01s
Generating the final image for sample 1/1.
Image to CPU 0.14912221s

Metal:

cargo run --release --features metal  --example stable-diffusion -- --sd-version xl --n-steps 8 --width 1024 --height 1024 --use-f16

runs and outputs

Tensor[dims 2, 77, 2048; f16, metal:4294969334]
Building the autoencoder.
Building the unet.
starting sampling
step 1/8 done, 3.68s
step 2/8 done, 14.28s
step 3/8 done, 16.89s
step 4/8 done, 16.22s
step 5/8 done, 16.73s
step 6/8 done, 17.44s
step 7/8 done, 16.68s
step 8/8 done, 16.81s
Generating the final image for sample 1/1.
Image to CPU 46.37571s

super-fun-surf avatar Jul 31 '24 00:07 super-fun-surf