Possible bug in `DisentangledSelfAttention`?

Open rcalef opened this issue 3 months ago • 0 comments

Thanks for the great work and open-sourced codebase for your model!

I've been working on a project that involves fine-tuning your model, and ran into one issue. I was fine-tuning using DDP and getting errors about some parameters being unused, and I found that all the modules that require gradients but aren't receiving any are the ss_q_proj layers in each encoder layer's attention block (i.e. prosst.encoder.layer[0-11].attention.self.ss_q_proj.weight and prosst.encoder.layer[0-11].attention.self.ss_q_proj.bias).

Looking into the code here, I noticed that in this block:

if "ss2aa" in self.pos_att_type:
  assert ss_hidden_states is not None
  ss_query_layer = self.ss_q_proj(ss_hidden_states)
  ss_query_layer = self.transpose_for_scores(ss_query_layer)
  ss_query_layer /= torch.sqrt(
    torch.tensor(ss_query_layer.size(-1), dtype=torch.float)
    * scale_factor
  )
  ss2aa_att = torch.matmul(
    key_layer, query_layer.transpose(-1, -2).to(dtype=key_layer.dtype)
  )
  score += ss2aa_att

it looks like ss_query_layer is created but not used. Should the line key_layer, query_layer.transpose(-1, -2).to(dtype=key_layer.dtype) instead be key_layer, ss_query_layer.transpose(-1, -2).to(dtype=key_layer.dtype)? It seems like that would be more in line with what it looks like the code was intended to do.

Apologies if I'm misunderstanding the codebase or method!

Sep 03 '25 15:09 rcalef