subtraction in attention sharing mechanism

Open luocfprime opened this issue 1 year ago • 0 comments

In the implementation of attention sharing, I noticed there's a stacked temporal attention adapter.

My question is, why did you subtract the modified_hidden_states with input h? Could you share some insights behind the rationale of this design? Thanks!

https://github.com/layerdiffusion/sd-forge-layerdiffuse/blob/e4d5060e05c7b4337a3258bb03c4e3ad2f8b15bb/lib_layerdiffusion/attention_sharing.py#L131-L137

May 29 '24 04:05 luocfprime