dataless-model-merging
dataless-model-merging copied to clipboard
How to implement RegMean for GPT-like model?
Is your feature request related to a problem? Please describe. GPT implementation by hugging face is different from T5 and Roberta due to it implementing a self-attention calculator in a parallel way like below:
def __init__():
self.c_attn = Conv1D(n_state * 3, nx)
...
def forward():
x = self.c_attn(x)
query, key, value = x.split(self.split_size, dim=2)
query = self.split_heads(query)
key = self.split_heads(key, k=True)
value = self.split_heads(value)
...
In this implementation, the RegMean suffers from an issue in the regmean_merge() function, i.e. the line 163 gram_m_ws.append(torch.matmul(param_grams, param transpose(0,1)))
, the matrix dimensional is not matched. param_grams is [1024, 1024], param is [1024, 1024*3].