unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

Support for mixtral moe

Open souvik3333 opened this issue 1 year ago • 10 comments

Any plan to support mixtral moe?

souvik3333 avatar Jan 25 '24 11:01 souvik3333

Yep - on our roadmap - but to be honest, our bandwidth as a 2 person startup is extremely limited - we thought about waiting to see what arch Llama 3 is, then seeing if we should support MoE type models.

New research from Llama-Pro and other depth extension papers and sparsity papers show the reason why MoEs work is due to the sparse nature of computation. However, the theory is if one trains the models for longer, the sparsity nature slowly diminishes, and the weights become dense. However this is speculation.

So in the best interest of things, I wanted to wait and see what Llama 3 will be - will it be more tokens, smaller models, or will Llama 3 also have a MoE? And if MoE - what arch?

danielhanchen avatar Jan 25 '24 12:01 danielhanchen

However, the theory is if one trains the models for longer, the sparsity nature slowly diminishes, and the weights become dense. However this is speculation.

Is this based on some experiments? Do you think very large (>50b params) ones will also be dense? For small models it makes sense.

I wanted to wait and see what Llama 3 will be - will it be more tokens, smaller models, or will Llama 3 also have a MoE? And if MoE - what arch?

Yeah makes sense. I just tried mistral 7b and the speed up and memory optimization was really nice. Really interested to try out on large models.

souvik3333 avatar Jan 25 '24 12:01 souvik3333

@danielhanchen I'd like to take a stab at it if that's ok. But I'm coming at it completely cold so it might take a while.

tohrnii avatar Jan 25 '24 16:01 tohrnii

@souvik3333 Not sure where I read it from sadly. @tohrnii Oh cool - more than happy for more help! I think technically a sub-optimized Mixtral is in fact doable (if one copies and pastes HF's code, and replaces it with Unsloth) - I just have to add a faster kernel for the MLP layer.

danielhanchen avatar Jan 25 '24 17:01 danielhanchen

I'd be interested in mixtral-instruct as well.

trenkert avatar Jan 25 '24 22:01 trenkert

Ohh noo @souvik3333 no need to close the issue!! Its good to keep this as a tracker for me internally :)

danielhanchen avatar Jan 27 '24 07:01 danielhanchen

what would it take for the community to help?

findalexli avatar Feb 09 '24 04:02 findalexli

@findalexli Oh so @tohrnii has been working on Mixtral :)) Maybe if you can clone the branch and try it out / or help them maybe, that'll be much appreciated :)

danielhanchen avatar Feb 09 '24 05:02 danielhanchen

Hmm, I would have thought it's not too bad to convert Mixtral to unsloth. Basically, the only modules that should be set to trainable are the attention ones, and those are set up just like mistral (every expert uses the same attention matrixes). This means there's no need to do any back-prop updates for the feedforward layers (which are the experts).

RonanKMcGovern avatar Mar 04 '24 12:03 RonanKMcGovern

@RonanKMcGovern Ye it can be ported over - in fact everything can be left, except for the optimization of the MoE part for the MLP layer - ther definitely is a speed up, albeit less

danielhanchen avatar Mar 04 '24 16:03 danielhanchen