unsloth Support for mixtral moe

Any plan to support mixtral moe?

Jan 25 '24 11:01 souvik3333

Yep - on our roadmap - but to be honest, our bandwidth as a 2 person startup is extremely limited - we thought about waiting to see what arch Llama 3 is, then seeing if we should support MoE type models.

New research from Llama-Pro and other depth extension papers and sparsity papers show the reason why MoEs work is due to the sparse nature of computation. However, the theory is if one trains the models for longer, the sparsity nature slowly diminishes, and the weights become dense. However this is speculation.

So in the best interest of things, I wanted to wait and see what Llama 3 will be - will it be more tokens, smaller models, or will Llama 3 also have a MoE? And if MoE - what arch?

Jan 25 '24 12:01 danielhanchen

However, the theory is if one trains the models for longer, the sparsity nature slowly diminishes, and the weights become dense. However this is speculation.

Is this based on some experiments? Do you think very large (>50b params) ones will also be dense? For small models it makes sense.

I wanted to wait and see what Llama 3 will be - will it be more tokens, smaller models, or will Llama 3 also have a MoE? And if MoE - what arch?

Yeah makes sense. I just tried mistral 7b and the speed up and memory optimization was really nice. Really interested to try out on large models.

Jan 25 '24 12:01 souvik3333

@danielhanchen I'd like to take a stab at it if that's ok. But I'm coming at it completely cold so it might take a while.

Jan 25 '24 16:01 tohrnii

@souvik3333 Not sure where I read it from sadly. @tohrnii Oh cool - more than happy for more help! I think technically a sub-optimized Mixtral is in fact doable (if one copies and pastes HF's code, and replaces it with Unsloth) - I just have to add a faster kernel for the MLP layer.

Jan 25 '24 17:01 danielhanchen

I'd be interested in mixtral-instruct as well.

Jan 25 '24 22:01 trenkert

Ohh noo @souvik3333 no need to close the issue!! Its good to keep this as a tracker for me internally :)

Jan 27 '24 07:01 danielhanchen

what would it take for the community to help?

Feb 09 '24 04:02 findalexli

@findalexli Oh so @tohrnii has been working on Mixtral :)) Maybe if you can clone the branch and try it out / or help them maybe, that'll be much appreciated :)

Feb 09 '24 05:02 danielhanchen

Hmm, I would have thought it's not too bad to convert Mixtral to unsloth. Basically, the only modules that should be set to trainable are the attention ones, and those are set up just like mistral (every expert uses the same attention matrixes). This means there's no need to do any back-prop updates for the feedforward layers (which are the experts).

Mar 04 '24 12:03 RonanKMcGovern

@RonanKMcGovern Ye it can be ported over - in fact everything can be left, except for the optimization of the MoE part for the MLP layer - ther definitely is a speed up, albeit less

Mar 04 '24 16:03 danielhanchen

unsloth unsloth copied to clipboard

Support for mixtral moe

unsloth
unsloth copied to clipboard