Fizz~ comments

Results 10 comments of


                                            Fizz~

schedulefree optimizers

lgtm! for what it's worth, there was some asking when this PR was originally opened about whether schedulefree's optims were effective on transformers; i did some testing a while ago,...

add support for adamw schedulefree

> upstreaming this @ [huggingface/transformers#30079](https://github.com/huggingface/transformers/pull/30079) now that this is merged, is there anything reqed by axolotl to implement now?

Add Olmo2

Is there a way to add something to it without quantization? All the current ones in there have some random quant attached to them

Add Olmo2

Looks like it needs transformers>=4.47.0, am I good to bump the version in the PR?

Add Olmo2

![image](https://github.com/user-attachments/assets/39577683-185a-4d44-9f48-d49b7ea85bb7) Other than a TF mismatch when installing Aphrodite, seems to work fine

Properly implement cooldown step parsing and explicitly support WSD schedules

Working on that now 🫡

Properly implement cooldown step parsing and explicitly support WSD schedules

... looks like this puppy has some fixing to do, that graph makes zero sense ![image](https://github.com/user-attachments/assets/fbae377a-7356-40a3-83a4-a34ce9ea11ba) ![image](https://github.com/user-attachments/assets/1fd2c9b7-d876-4894-b316-818aa1c9d8cb) either the slight jank i did to get it working on unpinned modern...

Properly implement cooldown step parsing and explicitly support WSD schedules

Ohhh it reports gradacc steps as individual steps 🤦‍♀️ that makes sense as to why the graph is funky! And thanks for the advice, I was trying out MLM initially...

Fixes to alternating SWA layers in Gemma2

Any updates on this? It's likely required to get the proper performance out of the Gemma 2 models

Mistral Nemo LoRA training has super high grad_norm

![image](https://github.com/user-attachments/assets/d45b38e5-f578-4120-a00c-eeb30d2cd53c) FWIW, an MN lora trained fine for me on 1xGPU but I'm still seeing people occasionally complaining about this being a bug. Possibly a multi-GPU issue? Seems to persist...