mergekit Separate MoE?

Hi, Sorry about asking so many questions, but do you know if it's possible to "unmerge" a MoE model and extract each expert as a separate model? For example, could I get 8 7B models from Mixtral? Thanks!

Jan 16 '24 21:01 fakerybakery

No worries, happy to answer any questions!

For a "real" MoE this can't really be meaningfully done. The "experts" in Mixtral, for example, aren't actually eight entire Mistral models - there are eight experts per layer, and they're simple MLPs. The order of these experts is completely arbitrary (you can permute the router and the experts to match and get the exact same output) so it doesn't really make sense to group them by number. A single token will activate different experts at each layer and I'd bet money that just about any arbitrary sequence of experts is both meaningful and useful.

If you just extract the Nth expert for each layer and put it into a Mistral model with all the shared parameters, you get something exceptionally useless. Think endless meaningless token spam. I did this and uploaded them a while back here if you want to play with them.

For mergekit-produced MoEs, this can sort of be done! The only catch is that the self attention, embedding, LM head, and normalization parameters are all shared, so you'd be extracting a funky merge of the "expert" model with the base.

Jan 17 '24 04:01 cg123

Thanks for the detailed explanation! Do you know what would happen if you merged the experts of Mixtral with the linear method?

Jan 17 '24 16:01 fakerybakery

Unfortunately also uninterpretable garbage. :( Maybe there's a merge technique that would make something work, but I haven't found one yet.

Jan 25 '24 05:01 cg123

Hmm, ok. Thanks for looking into this!

Jan 25 '24 17:01 fakerybakery

@cg123

Do you think you could possibly share me the script for separating agents you used?
here
I want to try this with the newest 8x22b, released on mistrals twitter, then repair it with a lora, with some continued pre-training, to see if i can get a useful 22b base model out of it.

This is the magnet to the new model (magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%http://2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%http://2Ftracker.opentrackr.org%3A1337%2Fannounce) and the twitter link Incase you got curious about it while reading this.

Apr 10 '24 01:04 NicolasMejiaPetit

@NickWithBotronics

Sure! I was able to dig it up - here's the script I used for the Mixtral separation:

import tqdm
from mergekit.io import TensorWriter, LazyTensorLoader
from mergekit.common import ModelReference

source_ref = ModelReference.model_validate("mistralai/Mixtral-8x7B-v0.1")
loader = LazyTensorLoader(source_ref.tensor_index())


def write_out_expert(out_path: str, moe_index: int):
    writer = TensorWriter(out_path, safe_serialization=True)
    for key in loader.index.tensor_paths:
        keyp = key
        if ".block_sparse_moe" in key:
            if not f".experts.{moe_index}" in key:
                continue
            keyp = key.replace(
                f".block_sparse_moe.experts.{moe_index}.w1", ".mlp.gate_proj"
            )
            keyp = keyp.replace(
                f".block_sparse_moe.experts.{moe_index}.w2", ".mlp.down_proj"
            )
            keyp = keyp.replace(
                f".block_sparse_moe.experts.{moe_index}.w3", ".mlp.up_proj"
            )
        writer.save_tensor(keyp, loader.get_tensor(key))
    writer.finalize()


for idx in tqdm.tqdm(range(8)):
    write_out_expert(f"/workspace/mixtral-expert-{idx}", idx)

This probably won't work unmodified with the magnet release. I believe Mistral's official releases use a different naming convention for the weights. Hopefully it's a helpful starting point though.

Apr 10 '24 02:04 cg123

Thank you so much! I’m sure it wasn’t easy to find a single python file; within all the different projects you do. I appreciate it, and all the other open source work you do like this entire repo!

Apr 10 '24 03:04 NicolasMejiaPetit

have you seen https://github.com/cognitivecomputations/extract-expert/blob/main/extract.py and https://huggingface.co/mmnga/Mixtral-Extraction-4x7B-Instruct-v0.1/blob/main/notebook/convert_mixtral_8x7b_to_4x7b_extract.ipynb ?

Jun 06 '24 05:06 CrispStrobe

mergekit mergekit copied to clipboard

Separate MoE?

mergekit
mergekit copied to clipboard