transformers
transformers copied to clipboard
mlp_only_layers is more flexible than decoder_sparse_step
Before this pr, the config field decoder_sparse_step decides which layers don't have experts, that means it can choose Qwen2MoeSparseMoeBlock or Qwen2MoeMLP for layers. however, the choose policy is not flexible enough, for example, when decoder_sparse_step = 2: layers 2, 4, 6, 8... will use Qwen2MoeSparseMoeBlock and layers 1, 3, 5, 7... will use Qwen2MoeMLP.
In this pr, the layer index array “mlp_only_layers” can choose for any layers to use Qwen2MoeSparseMoeBlock or Qwen2MoeMLP, for example , only layer 12 uses Qwen2MoeMLP, and others all use Qwen2MoeSparseMoeBlock.
This support for “mlp_only_layers” has significant importance on poor gpu devices like v100-16G. As i tested, qwen1.5-moe only requires a little more HBM before cuda OOM, then i only set layer 12 to use Qwen2MoeMLP rather than Qwen2MoeSparseMoeBlock, the model loaded succesfully. (2 pices of v100, and vllm inference)
It's true when set some layers to Qwen2MoeMLP will lose some weights and cause decline of accuracy, but there are solutions:
- use finetune data to run inference, and find which layer's Qwen2MoeSparseMoeBlock's experts' outputs are closest to zero, then set the layer to Qwen2MoeMLP.
- set mlp_only_layers first and do finetune.
- as i tested, change the middle layers just like layer 12 to Qwen2MoeMLP will nearly not affect the inference accuracy.
This pr do not change the model design, just make current feild decoder_sparse_step more flexible. Only set 1 or 2 or 3 layers to Qwen2MoeMLP can smoothly fit the model to poor HBM, and cost "minimal damage" to the model.
cc @ArthurZucker
@amyeroberts hi 1.do i need continuously merg latest modifications? 2. do i need fix all the ci errors?
Hi @eigen2017,
- You shouldn't need to continuously update from main. You will want your branch to be from a recent commit on
main, and obviously update if there's conflicts on files between this branch and main. Providing the PR is short-lived, the occasional rebase should be enough. - Yes, if they're related to this PR. Looking at the errors at the moment, it looks like these are just connection errors from the hub, in which case there's nothing for you to address. I'll re-run the tests now.
From the commit history, it looks like you might have rebased and not force pushed the branch. When rebasing, it's necessary to force push to properly update the remote, as rebasing is effectively re-writing history.
HI @amyeroberts , thks 4 your guidance i force pushed back, and fixed some workflow ci errors. see my "Files changed", it's the minimum modification.
@amyeroberts @ArthurZucker now all ci passed, pls review my code and merg, thks again to @amyeroberts for the instructions. i explain here again that this pr will not affect any existing logic when mlp_only_layers defaults to empty list. and mlp_only_layers can be a good setting field that make qwen moe models to fit in poor gpus.
@amyeroberts HI, many thanks to all the reviews! i will modify my code to match the suggestions. i still hold that, mlp_only_layers is not a new feature, it's only make decoder_sparse_step more flexible. just think why qwen moe model exposed decoder_sparse_step to users? it's means the original architecture supports on cutting specified layers' experts during finetuning and inference. or other creative scenarios, like task1 only use some layers' exports to finetune and infer, task2 use other layers, to improve multi-task effects. or use different data to finetune different layers' exports to make model have more generalization ability. or for my scenario, need trade off some accuracy for HBM, yes i can use decoder_sparse_step to cut exports, but the minimal cut is 12 from 24 layers, and then the model becomes dumb and hard to finetune or inference. no matter what scenarios decoder_sparse_step participants, mlp_only_layers can theoretically have better performance to decoder_sparse_step, and can support more scenarios .
i saw so many associated feature or issue request during my research on this, someone even requires vllm to support cpu offload inference just like deepspeed, btw deepspeed recently merged my code, i have 10 years experiences in AI research, recently i made huggingface framework work on ascend NPUs(HW gpu).
pls give a deep thought to this pr and conversations before decided merg or not , i sincerely want to contribute to great huggingface, thanks again.^_^
@amyeroberts HI~ code modified acording to your suggestion, thks!! and ci errors cleared again. pls help contact @ArthurZucker , thanks~ and qwen moe was first commited by @bozheng-hit too, so @bozheng-hit , if you see this , please give a reply and share your opinion~
this model is constructed by alibaba as i know , if there are members from alibaba to make a confirmations is welcomed too
@huybery HI ! as i found, you are qwen member, pls help to check this pr and give some opinion, thks
@ArthurZucker HI, pls give this a review , thks..
I actually think this is a lot simpler than the steps we used. Could be added for all MoE model, but this one is the only one that needs it for now!
many thanks for this confirmation!! immediately i'll commit to fit your review.
@ArthurZucker @amyeroberts all your reviews are accepted and modified, CI also passed, please help to merg this pr. thank you again, and thanks to greate huggingface~~~ ^_^
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@ArthurZucker HI.. code updated to your review suggestions..
@ArthurZucker @amyeroberts HI~, i finished my senario test and below is the report, i think it's helpful to record here.
| no model | glm finetune | moe finetune | moe finetune, then cut exp | moe cut exp, then finetune | |
|---|---|---|---|---|---|
| acc | 48.15% | 53.69% | 54.66% | 50.38% | 51.91% |
| gen | NA | 50 tok/s | 85 tok/s | 105 tok/s | 105 tok/s |
| weight | NA | 12GB | 27GB | 22GB | 22GB |
| gpu | NA | 1 | 4 | 2 | 2 |
overview:
| concepts | explaination |
|---|---|
| task | use glm or moe model to summarize and rewrite long user query, then pass to a faq system, hope to get better faq accuracy. |
| data set | non-public, for specified scenarios |
| infer tech | vllm, which requires more HBM for kv blocks, that's why orignal 27GB moe cannot load in 2 gpus |
| no model | pass the long user query directly to faq system |
| glm | chatglm3-6b |
| glm finetune | glm lora fintune, infer, then pass generated query to faq system |
| moe | Qwen1.5-MoE-A2.7B-Chat |
| moe finetune | moe lora finetune, infer, then pass generated query to faq system |
| moe finetune, then cut exp | moe lora finetune, cut exp(set mlp_only_layers to [12,14,16,18]), infer, then pass generated query to faq system |
| moe cut exp, then finetune | moe cut exp(set mlp_only_layers to [12,14,16,18]), then lora finetune, infer, then pass generated query to faq system |
concepts:
| concepts | explaination |
|---|---|
| acc | the final faq accuracy. |
| gen | token generation speed when inference only one user query's shortter form. |
| weight | minimal HBM requirements of one instance of the model |
| gpu | minimal requirements of nvidia v100-16G pices |
conclusions:
- moe has better acc and gen speed than glm, but need more HBM. it's trade HBM space for speed. yes 2gpus can load 2 instances of glm, but it can only doubles the tps, for only one request, moe is as twice speed as glm.
- cut exp causes decline of acc, but fewer gpus requirements and higher generation speed.
- cut exp before funetuning can rise acc, as compared to cut exp after funetuning.
table above "cut exp" means cut 4 layers: [12,14,16,18], i also tried --enforce-eager true for vllm, it's means disable cuda graph feature, to trade generate speed for fewer HBM requirement. then i can cut only one layer's experts(only the 12th layer). generation speed declined to 60 tok/s and acc rised to 53.43% on 2 gpus.
it's only my senario using mlp_only_layers, this flexible config field can create many more other usages i think.
Thanks for iterating!
it's my honor and pleasure !
🤗