diffusion-nbs Deep Dive NB: Quick Fix for AttributeError: 'CLIPTextTransformer' object has no attribute '_build_causal_attention

In the Stable Diffusion Deep Dive notebook, in the code plot immediately following the Transformer diagram, there is the definition of get_output_embeds which includes a call to text_encoder.text_model._build_causal_attention_mask:

def get_output_embeds(input_embeddings):
    # CLIP's text model uses causal mask, so we prepare it here:
    bsz, seq_len = input_embeddings.shape[:2]
    causal_attention_mask = text_encoder.text_model._build_causal_attention_mask(bsz, seq_len, dtype=input_embeddings.dtype)
    ...

That is currently generating an error for me when I run the notebook on Colab (from a fresh instance) or my home computer:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-33-dbb74b7ec9b4>](https://localhost:8080/#) in <cell line: 26>()
     24     return output
     25 
---> 26 out_embs_test = get_output_embeds(input_embeddings) # Feed through the model with our new function
     27 print(out_embs_test.shape) # Check the output shape
     28 out_embs_test # Inspect the output

1 frames
[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in __getattr__(self, name)
   1612             if name in modules:
   1613                 return modules[name]
-> 1614         raise AttributeError("'{}' object has no attribute '{}'".format(
   1615             type(self).__name__, name))
   1616 

AttributeError: 'CLIPTextTransformer' object has no attribute '_build_causal_attention_mask'

Everything in the notebook prior to that line runs fine.

Perhaps I'm doing something wrong, or perhaps something has changed with the HF libraries that being used, since the notebook's original conception?

UPDATE:

I see the same issue here: https://github.com/drboog/ProFusion/issues/12. It seems that transformers has changed. Downgrading to version 4.25.1 fixed the problem.

Thus changing the the pip install line at the top of the notebook to

!pip install -q --upgrade transformers==4.25.1 diffusers ftfy

...will restore full functionality.

Feel free to close this issue at your convenience. Perhaps a PR is in order.

Presumably some way to keep up to date with transformers will be preferable, but for now this is a quick fix.

Jul 02 '23 17:07 drscotthawley

@drscotthawley another fix, without having to downgrade, could be to use what the function _build_causal_attention_mask used to have (from here)

def build_causal_attention_mask(bsz, seq_len, dtype):
    # lazily create causal attention mask, with full attention between the vision tokens
    # pytorch uses additive attention mask; fill with -inf
    mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype)
    mask.fill_(torch.tensor(torch.finfo(dtype).min))
    mask.triu_(1)  # zero out the lower diagonal
    mask = mask.unsqueeze(1)  # expand mask
    return mask

PS: Thanks for the mps support changes

Aug 18 '23 12:08 nasheqlbrm

import torch

def build_causal_attention_mask(bsz, seq_len, dtype):
    mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype)
    mask.fill_(torch.tensor(torch.finfo(dtype).min))  # fill with large negative number (acts like -inf)
    mask = mask.triu_(1)  # zero out the lower diagonal to enforce causality
    return mask.unsqueeze(1)  # add a batch dimension

# Update your function call to use the new mask function
def get_output_embeds(input_embeddings):
    bsz, seq_len = input_embeddings.shape[:2]
    causal_attention_mask = build_causal_attention_mask(bsz, seq_len, dtype=input_embeddings.dtype)
    # Getting the output embeddings involves calling the model with passing output_hidden_states=True
    # so that it doesn't just return the pooled final predictions:
    encoder_outputs = text_encoder.text_model.encoder(
        inputs_embeds=input_embeddings,
        attention_mask=None, # We aren't using an attention mask so that can be None
        causal_attention_mask=causal_attention_mask.to(torch_device),
        output_attentions=None,
        output_hidden_states=True, # We want the output embs not the final output
        return_dict=None,
    )

    # We're interested in the output hidden state only
    output = encoder_outputs[0]

    # There is a final layer norm we need to pass these through
    output = text_encoder.text_model.final_layer_norm(output)

    # And now they're ready!
    return output

out_embs_test = get_output_embeds(input_embeddings) # Feed through the model with our new function
print(out_embs_test.shape) # Check the output shape
out_embs_test # Inspect the output

Apr 24 '24 01:04 giantvision

if you install the new transformers library like version of 4.40.2,you can implement as below:

from transformers.modeling_attn_mask_utils import _create_4d_casual_attention_mask
def get_output_embeds(input_embeddings):
    input_ids = (bsz, seq_len) = input_embeddings.shape[:2]
    causal_attention_mask = _create_4d_casual_attention_mask(input_ids, dtype=input_embeddings.dtype, device=torch_device)

May 28 '24 05:05 frankchieng

diffusion-nbs
diffusion-nbs copied to clipboard

Deep Dive NB: Quick Fix for AttributeError: 'CLIPTextTransformer' object has no attribute '_build_causal_attention_mask'

UPDATE:

diffusion-nbs diffusion-nbs copied to clipboard

Deep Dive NB: Quick Fix for AttributeError: 'CLIPTextTransformer' object has no attribute '_build_causal_attention_mask'

UPDATE:

diffusion-nbs
diffusion-nbs copied to clipboard