unsloth
unsloth copied to clipboard
Support for mixtral moe
Any plan to support mixtral moe?
Yep - on our roadmap - but to be honest, our bandwidth as a 2 person startup is extremely limited - we thought about waiting to see what arch Llama 3 is, then seeing if we should support MoE type models.
New research from Llama-Pro and other depth extension papers and sparsity papers show the reason why MoEs work is due to the sparse nature of computation. However, the theory is if one trains the models for longer, the sparsity nature slowly diminishes, and the weights become dense. However this is speculation.
So in the best interest of things, I wanted to wait and see what Llama 3 will be - will it be more tokens, smaller models, or will Llama 3 also have a MoE? And if MoE - what arch?
However, the theory is if one trains the models for longer, the sparsity nature slowly diminishes, and the weights become dense. However this is speculation.
Is this based on some experiments? Do you think very large (>50b params) ones will also be dense? For small models it makes sense.
I wanted to wait and see what Llama 3 will be - will it be more tokens, smaller models, or will Llama 3 also have a MoE? And if MoE - what arch?
Yeah makes sense. I just tried mistral 7b and the speed up and memory optimization was really nice. Really interested to try out on large models.
@danielhanchen I'd like to take a stab at it if that's ok. But I'm coming at it completely cold so it might take a while.
@souvik3333 Not sure where I read it from sadly. @tohrnii Oh cool - more than happy for more help! I think technically a sub-optimized Mixtral is in fact doable (if one copies and pastes HF's code, and replaces it with Unsloth) - I just have to add a faster kernel for the MLP layer.
I'd be interested in mixtral-instruct as well.
Ohh noo @souvik3333 no need to close the issue!! Its good to keep this as a tracker for me internally :)
what would it take for the community to help?
@findalexli Oh so @tohrnii has been working on Mixtral :)) Maybe if you can clone the branch and try it out / or help them maybe, that'll be much appreciated :)
Hmm, I would have thought it's not too bad to convert Mixtral to unsloth. Basically, the only modules that should be set to trainable are the attention ones, and those are set up just like mistral (every expert uses the same attention matrixes). This means there's no need to do any back-prop updates for the feedforward layers (which are the experts).
@RonanKMcGovern Ye it can be ported over - in fact everything can be left, except for the optimization of the MoE part for the MLP layer - ther definitely is a speed up, albeit less