diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Kandinsky5TimeEmbeddings hardcodes 'cuda' in @torch.autocast decorator, causing warning on non-CUDA systems

Open knd0331 opened this issue 3 weeks ago • 6 comments

Body:

Describe the bug

When importing diffusers on a non-CUDA system (e.g., Apple Silicon Mac with MPS), a warning is emitted:

/torch/amp/autocast_mode.py:270: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling warnings.warn(

This occurs because Kandinsky5TimeEmbeddings class has a hardcoded "cuda" device type in the @torch.autocast decorator.

Location

File: diffusers/models/transformers/transformer_kandinsky.py Line: 168

@torch.autocast(device_type="cuda", dtype=torch.float32)
def forward(self, timestep):

Root Cause

The decorator is evaluated at import time, not at runtime. On systems without CUDA (like Apple
Silicon Macs using MPS), this triggers the warning even though the Kandinsky model may never be
used.

Reproduction

# On a Mac with Apple Silicon (no CUDA)
from diffusers import ZImagePipeline  # or any pipeline
# Warning appears immediately on import

Expected behavior

No warning should appear when importing diffusers on non-CUDA systems.

Environment

- OS: macOS (Apple Silicon M-series)
- Python: 3.13
- PyTorch: 2.x (MPS backend)
- Diffusers: latest

knd0331 avatar Dec 09 '25 04:12 knd0331

I was able to reproduce the above UserWarning on a non-CUDA setup.

In addition to Kandinsky5TimeEmbeddings, I noticed the Kandinsky5Modulation class also uses the same decorator pattern with a hardcoded device check, contributing to the issue.

Looking at the comment here by @leffff, the intent was to force these operations to run in float32 to prevent precision loss and NaNs during mixed-precision training.

We can remove the @torch.autocast decorator from both classes and explicitly cast the inputs to float32 inside their forward methods (e.g., time = time.to(dtype=torch.float32)). I verified locally that this removes the warning.

Does this sound good? @leffff @yiyixuxu

adi776borate avatar Dec 09 '25 06:12 adi776borate

btw, this is not just a warning, this causes a failure down the road on torch-xpu:

│D:\sdnext\venv\Lib\site-packages\diffusers\models\transformers\transformer_kandinsky.py:172 in forward                                                                                                │
│                                                                                                                                                                                                      │
│  171 │   │   time_embed = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)                                                                                                                      │
│❱ 172 │   │   time_embed = self.out_layer(self.activation(self.in_layer(time_embed)))                                                                                                                 │
│  173 │   │   return time_embed     

RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16

vladmandic avatar Dec 09 '25 09:12 vladmandic

Hi! I See the problem. If you have fixes you want to propose please create a pull request. However, does Flex Attention work fine on non Cuda systems?

leffff avatar Dec 09 '25 10:12 leffff

Hi! I See the problem. If you have fixes you want to propose please create a pull request. However, does Flex Attention work fine on non Cuda systems?

I have gone through the Pytorch source. According to it, Flex Attention is supported on CPUs only if they have AVX2 instruction set support and are not running on macOS.

adi776borate avatar Dec 09 '25 13:12 adi776borate

Thanks for pointing that out @vladmandic. To fix this, we should upcast the input to fp32, compute the embeddings, and downcast the result of the math back to weight.dtype before passing to the Linear layer. I'll open a PR for this.

adi776borate avatar Dec 09 '25 13:12 adi776borate

When you open the PR, plz tag me and provide examples Of before/after generations from the same noise,so we make sure the results are stable.

leffff avatar Dec 09 '25 14:12 leffff