diffusers Provided pooled_prompt_embeds is overwritten via prompt

https://github.com/huggingface/diffusers/blob/25caf24ef90fc44074f4fd3712f6ed5a1db4a5c3/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L386

Simple fix:

pooled_prompt_embeds = prompt_embeds[0] if pooled_prompt_embeds is not None else pooled_prompt_embeds

Sorry this isn't a pr :P

Mar 18 '24 03:03 cloneofsimo

Golden catch. Please PR it.

Mar 18 '24 09:03 sayakpaul

the SDXL training scripts are broken this way as well @patrickvonplaten @sayakpaul which means the models that are trained with diffusers are likely all broken.

Mar 21 '24 17:03 bghira

the prompt embed code also pools incorrectly. as tested by @AmericanPresidentJimmyCarter it looks like the correct method would be to pool the final layer while sampling hidden states from the penultimate layer.

currently, it just retrieves the first token and stresses that as if it were the pooled embed. the same thing happens for the negative prompt.

it should run mean(dim=1) over the final layer instead.

Mar 21 '24 18:03 bghira

So I was unsure how it was pooled before, and we were going through the code trying to figure it out. It seems like

pooled_prompt_embeds = prompt_embeds[0]

Should really be

            pooled_prompt_embeds = list(filter(lambda x: x is not None, [
                getattr(prompt_embeds, 'pooler_output', None),
                getattr(prompt_embeds, 'text_embeds', None),
            ]))[0]

For clarity, as the two CLIP models output completely different classes and contain their pooled outputs in different properties. My concern was originally that instead of using the pooled output in the case of the one CLIP model, we were actually selecting the first token with [0] if that was the output of the last hidden layer instead of the pooled output. Simply referencing it as [0] is extremely unclear given the nature of the output from the two CLIP models.

Mar 21 '24 19:03 AmericanPresidentJimmyCarter

Order in which the tokenizers and text encoders are being passed matters, so I think the implementation is correct. If any comment would help, please file a PR, more than happy to work on a priority on that.

Simply referencing it as [0] is extremely unclear given the nature of the output from the two CLIP models.

That reason it's there is because it helps with torch.compile() otherwise there's a TensorVariableAccess problem.

Mar 22 '24 02:03 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 17 '24 15:04 github-actions[bot]

not stale

Apr 17 '24 16:04 bghira

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 12 '24 15:05 github-actions[bot]

still not stale :D

May 12 '24 16:05 bghira

@sayakpaul i opened the pull request for this. but the code in question only runs when prompt embeds are None.

do we want to mix and match provided pooled embeds with generated prompt embeds?

May 12 '24 16:05 bghira

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 14 '24 15:09 github-actions[bot]

diffusers
diffusers copied to clipboard

Provided pooled_prompt_embeds is overwritten via prompt_embeds[0]

diffusers diffusers copied to clipboard

Provided pooled_prompt_embeds is overwritten via prompt_embeds[0]

diffusers
diffusers copied to clipboard