torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

support for gemma-2

Open almugabo opened this issue 1 year ago • 2 comments

Could you please add fine-tuning support for gemma-2 ? It has good good multilingual capabilities and is a good candidate for fine-tuning for languages other than English. Its different sizes also make it attractive for fine-tuning for different tasks. I would gladly help but am not knowledgeable enough Thank you

almugabo avatar Oct 11 '24 15:10 almugabo

Actually, in inspiration of one current kaggle competition - it is really good idea too add this pretty soon.

krammnic avatar Oct 11 '24 15:10 krammnic

Thanks @almugabo for creating the issue. I think this will be a bit of effort, quickly jotting down a couple of things I'm aware of that we'd need to support:

  • [ ] Logit softcapping in attention layer
  • [ ] Logit softcapping in output layer
  • [ ] Sliding window attention
  • [ ] Expose sliding window attention in configurable set of layers
  • [ ] Post layernorm for both attention and FFN (hacky thing to do is just use attn_scale and mlp_scale)
  • [ ] Model builders for 2B, 9B, 27B sizes

For logit softcapping and sliding window attention, I suspect we can use FlexAttention APIs. See this blog post where they give explicit examples of each.

ebsmothers avatar Oct 11 '24 15:10 ebsmothers

Hello, I have started the addition of gemma2, I will create now my PR in WIP mode. I haven't run any test yet but will do soon!

Edit: My PR is here: #1835

@ebsmothers it would be great if you could have a quick look to validate the choices I made in order to implement sliding windows, pre-post layer normalisation and softcapping... I would be happy to make things differently to keep the changes minimal (I tried as much as possible to keep all changes minimal).

Optimox avatar Oct 15 '24 09:10 Optimox

I didn't know about FlexAttention, I will look into it!

Optimox avatar Oct 15 '24 10:10 Optimox

ADDED!

joecummings avatar Dec 10 '24 11:12 joecummings