mergekit icon indicating copy to clipboard operation
mergekit copied to clipboard

Mixtral branch: What option should I choose when I want to do some finetuning after the merge?

Open PhilipMay opened this issue 1 year ago • 5 comments

The parameter description of "hidden" and "random" does not exactly explain what to do when I want to finetune later.

Is it even useful (possible) to finetune after merging with "hidden" option? What would you recommend when I want to finetune later? "hidden" or "random"?

Maybe you could add that to the documentation?

PhilipMay avatar Jan 16 '24 21:01 PhilipMay

If you're planning to fine tune the model I would recommend using the random option. That's what I use for my tiny MoE pre-training experiments.

The hidden option would probably work too. It might be a worse initialization - if you're using a balancing loss (and you probably should) it'll be very upset with you to start. But if set up well it might also be a good initial hint at the capabilities of each "expert". Frankly I don't know - I'd love to do a bunch of experiments on what works best for downstream fine tuning but I don't have the resources to fully explore it.

Hope this helps!

cg123 avatar Jan 17 '24 04:01 cg123

@cg123 Did you combine multiple LLMs to a MoE and then already did some more training and experimentation? Are you willing to share the configurations and ideas behind this?

We have the resources to do training after merging and I am also willing to share the results and insights.

Many thanks Philip

PhilipMay avatar Jan 19 '24 06:01 PhilipMay

I've done some very minimal experiments, yes - not sure how useful they'll be to you but I'm happy to share.

I trained a few different versions of an 8x MoE of smol_llama-101M-GQA. All of them were from the same initialization, using this config:

base_model: BEE-spoke-data/smol_llama-101M-GQA
gate_mode: random
experts:
  - source_model: BEE-spoke-data/smol_llama-101M-GQA
    positive_prompts: []
  - source_model: BEE-spoke-data/smol_llama-101M-GQA
    positive_prompts: []
  - source_model: BEE-spoke-data/smol_llama-101M-GQA
    positive_prompts: []
  - source_model: BEE-spoke-data/smol_llama-101M-GQA
    positive_prompts: []
  - source_model: BEE-spoke-data/smol_llama-101M-GQA
    positive_prompts: []
  - source_model: BEE-spoke-data/smol_llama-101M-GQA
    positive_prompts: []
  - source_model: BEE-spoke-data/smol_llama-101M-GQA
    positive_prompts: []
  - source_model: BEE-spoke-data/smol_llama-101M-GQA
    positive_prompts: []

The main experiment I did was with the auxiliary loss function used. After a basic hyperparameter sweep (I ended up with a LR of 0.0001 for this set up) I trained a version using the standard Mixtral auxiliary balancing loss, then another using both that loss plus the z-loss introduced by ST-MoE. The data used was a very small subset of SlimPajama, around 50M tokens.

In evaluation, the vanilla loss got an MMLU score of 24.69 and the version with z-loss scored 25.07. The base 101M model used scores at 24.24. So, with both loss configurations, and the main takeaway from this experiment - sparse upcycling with a configuration like this can work, honestly much better than I was expecting. Second takeaway was that at least with this specific model, z-loss was more stable in training and ended up with noticeably better results.

So really only the barest start at exploring what can be done here. It's not much but I hope it's helpful - and if you do end up with any interesting results I'd love to see them.

cg123 avatar Jan 25 '24 06:01 cg123

Thanks for sharing, @cg123! Agreed, that is promising.

Do you mind sharing your scripts you used for fine tuning? I am also very interested in this aspect of LLM research, as well. Perhaps we can establish an OSS repo for this :)

CC: @PhilipMay

mstallone avatar Feb 01 '24 23:02 mstallone

+1

sd3ntato avatar Feb 02 '24 22:02 sd3ntato