mup fix: adopt mup/Transformers API for torch2.3

fix: adopt mup/Transformers API for torch2.3

Open emergenz opened this issue 1 year ago • 0 comments

Adding batch_first as an __init__ argument of MultiHeadAttention is just a quickfix since we are ignoring it.

It does the job, though.

Jul 17 '24 11:07 emergenz