diffusers
diffusers copied to clipboard
Stable Diffusion: providing a list of positive prompts and a list of negative prompts does not work as expected
Describe the bug
See this forum post: https://discuss.huggingface.co/t/stable-diffusion-bs-1-uses-negative-as-prompt/24130.
In short:
_ = pipe(["frog"]*2, negative_prompt=["bird"]*2)
Reaches this condition https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L247. As expected, negative_prompt
is a list with the same cardinality of prompt
(otherwise an exception would have been raised). Therefore, in this case uncond_input.input_ids
would have shape (2, 77, 768)
. But then https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L247 makes it become (4, 77, 768)
because batch_size
is 2. Therefore the concatenation operation in the next line creates text_embeddings
with shape (6, 77, 768)
. The first 4 ones are the ones corresponding to the negative prompt, and that's what gets passed to the model because the latents
are correctly computed with shape (2, 4, 64, 64)
(later expanded to batch size 4
during cfg).
Reproduction
I could reproduce as explained above.
Logs
No response
System Info
diffusers @ main (f3983d16eed57e46742d217363d8913bef7f748d
)
Any ideas on how we can solve it?
Running into the same issue. Seems the batchwise aspect is added in two places, causing the issue.
The code expects the negative prompt list to be the same length as the positive prompt list, ie
_ = pipe([positive_prompt]*bs, negative_prompt=[negative_prompt])
throws an error
`negative_prompt`: [negative_prompt] has batch size 1, but `prompt`: [positive_prompt, positive_prompt, ...] has batch size bs
but if you instead run
_ = pipe([positive_prompt]*bs, negative_prompt=[negative_prompt]*bs)
Then the following section repeats along the batch dimension unnecessarily, messing with the sizes during the view
uncond_input = self.tokenizer(
uncond_tokens,
padding="max_length",
max_length=max_length,
truncation=True,
return_tensors="pt",
)
uncond_embeddings = self.text_encoder(uncond_input.input_ids.to(self.device))[0]
seq_len = uncond_embeddings.shape[1]
uncond_embeddings = uncond_embeddings.repeat(batch_size, num_images_per_prompt, 1)
uncond_embeddings = uncond_embeddings.view(batch_size * num_images_per_prompt, seq_len, -1)
When I run this, my uncond_embeddings
end up with a size of [bs, sl, 768*bs]
, which causes an error during concatenation to the positive embeddings
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I think this was resolved in #1120.