diffusers [testing] tiny stable diffusion pipe throws error when used with safety checker

Describe the bug

The tiny stable diffusion pipeline we use for some tests throws an error when used with the safety checker. It currently doesn't throw an error in any tests because we always use it without the safety checker.
https://huggingface.co/hf-internal-testing/tiny-stable-diffusion-pipe
Reproduction

In [1]: from diffusers import DiffusionPipeline

In [2]: pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe')
Fetching 15 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 11913.38it/s]
The config attributes {'dropout': 0.0} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.

In [3]: pipe('foo')
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:11<00:00,  4.23it/s]
[W NNPACK.cpp:53] Could not initialize NNPACK! Reason: Unsupported hardware.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-3-5d36afd6280f>:1 in <module>                                                     │
│                                                                                                  │
│ /Users/will/opt/anaconda3/envs/hf/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27 in  │
│ decorate_context                                                                                 │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /Users/will/git/diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py: │
│ 639 in __call__                                                                                  │
│                                                                                                  │
│   636 │   │   │   image = self.decode_latents(latents)                                           │
│   637 │   │   │                                                                                  │
│   638 │   │   │   # 9. Run safety checker                                                        │
│ ❱ 639 │   │   │   image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embe   │
│   640 │   │   │                                                                                  │
│   641 │   │   │   # 10. Convert to PIL                                                           │
│   642 │   │   │   image = self.numpy_to_pil(image)                                               │
│                                                                                                  │
│ /Users/will/git/diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py: │
│ 360 in run_safety_checker                                                                        │
│                                                                                                  │
│   357 │   def run_safety_checker(self, image, device, dtype):                                    │
│   358 │   │   if self.safety_checker is not None:                                                │
│   359 │   │   │   safety_checker_input = self.feature_extractor(self.numpy_to_pil(image), retu   │
│ ❱ 360 │   │   │   image, has_nsfw_concept = self.safety_checker(                                 │
│   361 │   │   │   │   images=image, clip_input=safety_checker_input.pixel_values.to(dtype)       │
│   362 │   │   │   )                                                                              │
│   363 │   │   else:                                                                              │
│                                                                                                  │
│ /Users/will/opt/anaconda3/envs/hf/lib/python3.9/site-packages/torch/nn/modules/module.py:1190 in │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1187 │   │   # this function, and just call forward.                                           │
│   1188 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1189 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1190 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1191 │   │   # Do not call functions when jit is used                                          │
│   1192 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1193 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /Users/will/opt/anaconda3/envs/hf/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27 in  │
│ decorate_context                                                                                 │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /Users/will/git/diffusers/src/diffusers/pipelines/stable_diffusion/safety_checker.py:51 in       │
│ forward                                                                                          │
│                                                                                                  │
│    48 │                                                                                          │
│    49 │   @torch.no_grad()                                                                       │
│    50 │   def forward(self, clip_input, images):                                                 │
│ ❱  51 │   │   pooled_output = self.vision_model(clip_input)[1]  # pooled_output                  │
│    52 │   │   image_embeds = self.visual_projection(pooled_output)                               │
│    53 │   │                                                                                      │
│    54 │   │   # we always cast to float32 as this does not cause significant overhead and is c   │
│                                                                                                  │
│ /Users/will/opt/anaconda3/envs/hf/lib/python3.9/site-packages/torch/nn/modules/module.py:1190 in │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1187 │   │   # this function, and just call forward.                                           │
│   1188 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1189 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1190 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1191 │   │   # Do not call functions when jit is used                                          │
│   1192 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1193 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /Users/will/opt/anaconda3/envs/hf/lib/python3.9/site-packages/transformers/models/clip/modeling_ │
│ clip.py:929 in forward                                                                           │
│                                                                                                  │
│    926 │   │   ```"""                                                                            │
│    927 │   │   return_dict = return_dict if return_dict is not None else self.config.use_return  │
│    928 │   │                                                                                     │
│ ❱  929 │   │   return self.vision_model(                                                         │
│    930 │   │   │   pixel_values=pixel_values,                                                    │
│    931 │   │   │   output_attentions=output_attentions,                                          │
│    932 │   │   │   output_hidden_states=output_hidden_states,                                    │
│                                                                                                  │
│ /Users/will/opt/anaconda3/envs/hf/lib/python3.9/site-packages/torch/nn/modules/module.py:1190 in │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1187 │   │   # this function, and just call forward.                                           │
│   1188 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1189 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1190 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1191 │   │   # Do not call functions when jit is used                                          │
│   1192 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1193 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /Users/will/opt/anaconda3/envs/hf/lib/python3.9/site-packages/transformers/models/clip/modeling_ │
│ clip.py:854 in forward                                                                           │
│                                                                                                  │
│    851 │   │   if pixel_values is None:                                                          │
│    852 │   │   │   raise ValueError("You have to specify pixel_values")                          │
│    853 │   │                                                                                     │
│ ❱  854 │   │   hidden_states = self.embeddings(pixel_values)                                     │
│    855 │   │   hidden_states = self.pre_layrnorm(hidden_states)                                  │
│    856 │   │                                                                                     │
│    857 │   │   encoder_outputs = self.encoder(                                                   │
│                                                                                                  │
│ /Users/will/opt/anaconda3/envs/hf/lib/python3.9/site-packages/torch/nn/modules/module.py:1190 in │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1187 │   │   # this function, and just call forward.                                           │
│   1188 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1189 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1190 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1191 │   │   # Do not call functions when jit is used                                          │
│   1192 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1193 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /Users/will/opt/anaconda3/envs/hf/lib/python3.9/site-packages/transformers/models/clip/modeling_ │
│ clip.py:196 in forward                                                                           │
│                                                                                                  │
│    193 │   │                                                                                     │
│    194 │   │   class_embeds = self.class_embedding.expand(batch_size, 1, -1)                     │
│    195 │   │   embeddings = torch.cat([class_embeds, patch_embeds], dim=1)                       │
│ ❱  196 │   │   embeddings = embeddings + self.position_embedding(self.position_ids)              │
│    197 │   │   return embeddings                                                                 │
│    198                                                                                           │
│    199                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (12545) must match the size of tensor b (226) at non-singleton dimension 1

In [4]: pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe', safety_checker=None)
Fetching 15 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 6754.11it/s]
The config attributes {'dropout': 0.0} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .

In [5]: pipe('foo')
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.03it/s]
Out[5]: StableDiffusionPipelineOutput(images=[<PIL.Image.Image image mode=RGB size=64x64 at 0x7FDB359E8310>], nsfw_content_detected=None)

In [6]:
Logs

No response
System Info

n/a
Feb 13 '23 21:02 williamberman
Yeah that's fine :-) Think I just didn't take the time to make a tiny safety checker. The model has random weights and is only used for testing so I think it's not a big deal
Feb 14 '23 21:02 patrickvonplaten
Yeah that's fine :-) Think I just didn't take the time to make a tiny safety checker. The model has random weights and is only used for testing so I think it's not a big deal
Yeah, not a big deal. Opened issue as reminder to look into if I get time
Feb 15 '23 05:02 williamberman
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Mar 16 '23 15:03 github-actions[bot]