torchtune support for gemma-2

Could you please add fine-tuning support for gemma-2 ? It has good good multilingual capabilities and is a good candidate for fine-tuning for languages other than English. Its different sizes also make it attractive for fine-tuning for different tasks. I would gladly help but am not knowledgeable enough Thank you

Oct 11 '24 15:10 almugabo

Actually, in inspiration of one current kaggle competition - it is really good idea too add this pretty soon.

Oct 11 '24 15:10 krammnic

Thanks @almugabo for creating the issue. I think this will be a bit of effort, quickly jotting down a couple of things I'm aware of that we'd need to support:

[ ] Logit softcapping in attention layer
[ ] Logit softcapping in output layer
[ ] Sliding window attention
[ ] Expose sliding window attention in configurable set of layers
[ ] Post layernorm for both attention and FFN (hacky thing to do is just use attn_scale and mlp_scale)
[ ] Model builders for 2B, 9B, 27B sizes

For logit softcapping and sliding window attention, I suspect we can use FlexAttention APIs. See this blog post where they give explicit examples of each.

Oct 11 '24 15:10 ebsmothers

Hello, I have started the addition of gemma2, I will create now my PR in WIP mode. I haven't run any test yet but will do soon!

Edit: My PR is here: #1835

@ebsmothers it would be great if you could have a quick look to validate the choices I made in order to implement sliding windows, pre-post layer normalisation and softcapping... I would be happy to make things differently to keep the changes minimal (I tried as much as possible to keep all changes minimal).

Oct 15 '24 09:10 Optimox

I didn't know about FlexAttention, I will look into it!

Oct 15 '24 10:10 Optimox

ADDED!

Dec 10 '24 11:12 joecummings