qiguming

Results 2 issues of qiguming

if crossattn: detach = torch.ones_like(key) detach[:, :1, :] = detach[:, :1, :]0. key = detachkey + (1-detach)key.detach() value = detachvalue + (1-detach)*value.detach() Why stop the gradient of the first key-value...