mup
mup copied to clipboard
fix: adopt mup/Transformers API for torch2.3
Adding batch_first as an __init__ argument of MultiHeadAttention is just a quickfix since we are ignoring it.
It does the job, though.