mergekit
mergekit copied to clipboard
Separate MoE?
Hi, Sorry about asking so many questions, but do you know if it's possible to "unmerge" a MoE model and extract each expert as a separate model? For example, could I get 8 7B models from Mixtral? Thanks!
No worries, happy to answer any questions!
For a "real" MoE this can't really be meaningfully done. The "experts" in Mixtral, for example, aren't actually eight entire Mistral models - there are eight experts per layer, and they're simple MLPs. The order of these experts is completely arbitrary (you can permute the router and the experts to match and get the exact same output) so it doesn't really make sense to group them by number. A single token will activate different experts at each layer and I'd bet money that just about any arbitrary sequence of experts is both meaningful and useful.
If you just extract the Nth expert for each layer and put it into a Mistral model with all the shared parameters, you get something exceptionally useless. Think endless meaningless token spam. I did this and uploaded them a while back here if you want to play with them.
For mergekit-produced MoEs, this can sort of be done! The only catch is that the self attention, embedding, LM head, and normalization parameters are all shared, so you'd be extracting a funky merge of the "expert" model with the base.
Thanks for the detailed explanation! Do you know what would happen if you merged the experts of Mixtral with the linear method?
Unfortunately also uninterpretable garbage. :( Maybe there's a merge technique that would make something work, but I haven't found one yet.
Hmm, ok. Thanks for looking into this!
@cg123
-
Do you think you could possibly share me the script for separating agents you used?
-
I want to try this with the newest 8x22b, released on mistrals twitter, then repair it with a lora, with some continued pre-training, to see if i can get a useful 22b base model out of it.
This is the magnet to the new model (magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%http://2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%http://2Ftracker.opentrackr.org%3A1337%2Fannounce) and the twitter link Incase you got curious about it while reading this.
@NickWithBotronics
Sure! I was able to dig it up - here's the script I used for the Mixtral separation:
import tqdm
from mergekit.io import TensorWriter, LazyTensorLoader
from mergekit.common import ModelReference
source_ref = ModelReference.model_validate("mistralai/Mixtral-8x7B-v0.1")
loader = LazyTensorLoader(source_ref.tensor_index())
def write_out_expert(out_path: str, moe_index: int):
writer = TensorWriter(out_path, safe_serialization=True)
for key in loader.index.tensor_paths:
keyp = key
if ".block_sparse_moe" in key:
if not f".experts.{moe_index}" in key:
continue
keyp = key.replace(
f".block_sparse_moe.experts.{moe_index}.w1", ".mlp.gate_proj"
)
keyp = keyp.replace(
f".block_sparse_moe.experts.{moe_index}.w2", ".mlp.down_proj"
)
keyp = keyp.replace(
f".block_sparse_moe.experts.{moe_index}.w3", ".mlp.up_proj"
)
writer.save_tensor(keyp, loader.get_tensor(key))
writer.finalize()
for idx in tqdm.tqdm(range(8)):
write_out_expert(f"/workspace/mixtral-expert-{idx}", idx)
This probably won't work unmodified with the magnet release. I believe Mistral's official releases use a different naming convention for the weights. Hopefully it's a helpful starting point though.
Thank you so much! I’m sure it wasn’t easy to find a single python file; within all the different projects you do. I appreciate it, and all the other open source work you do like this entire repo!
have you seen https://github.com/cognitivecomputations/extract-expert/blob/main/extract.py and https://huggingface.co/mmnga/Mixtral-Extraction-4x7B-Instruct-v0.1/blob/main/notebook/convert_mixtral_8x7b_to_4x7b_extract.ipynb ?