transformers mlp_only_layers is more flexible than decoder_sparse

Before this pr, the config field decoder_sparse_step decides which layers don't have experts, that means it can choose Qwen2MoeSparseMoeBlock or Qwen2MoeMLP for layers. however, the choose policy is not flexible enough, for example, when decoder_sparse_step = 2: layers 2, 4, 6, 8... will use Qwen2MoeSparseMoeBlock and layers 1, 3, 5, 7... will use Qwen2MoeMLP.

In this pr, the layer index array “mlp_only_layers” can choose for any layers to use Qwen2MoeSparseMoeBlock or Qwen2MoeMLP, for example , only layer 12 uses Qwen2MoeMLP, and others all use Qwen2MoeSparseMoeBlock.

This support for “mlp_only_layers” has significant importance on poor gpu devices like v100-16G. As i tested, qwen1.5-moe only requires a little more HBM before cuda OOM, then i only set layer 12 to use Qwen2MoeMLP rather than Qwen2MoeSparseMoeBlock, the model loaded succesfully. (2 pices of v100, and vllm inference)

It's true when set some layers to Qwen2MoeMLP will lose some weights and cause decline of accuracy, but there are solutions:

use finetune data to run inference, and find which layer's Qwen2MoeSparseMoeBlock's experts' outputs are closest to zero, then set the layer to Qwen2MoeMLP.
set mlp_only_layers first and do finetune.
as i tested, change the middle layers just like layer 12 to Qwen2MoeMLP will nearly not affect the inference accuracy.

This pr do not change the model design, just make current feild decoder_sparse_step more flexible. Only set 1 or 2 or 3 layers to Qwen2MoeMLP can smoothly fit the model to poor HBM, and cost "minimal damage" to the model.

Apr 29 '24 17:04 eigen2017

cc @ArthurZucker

Apr 29 '24 17:04 amyeroberts

@amyeroberts hi 1.do i need continuously merg latest modifications？ 2. do i need fix all the ci errors?

Apr 30 '24 13:04 eigen2017

Hi @eigen2017,

You shouldn't need to continuously update from main. You will want your branch to be from a recent commit on main, and obviously update if there's conflicts on files between this branch and main. Providing the PR is short-lived, the occasional rebase should be enough.
Yes, if they're related to this PR. Looking at the errors at the moment, it looks like these are just connection errors from the hub, in which case there's nothing for you to address. I'll re-run the tests now.

From the commit history, it looks like you might have rebased and not force pushed the branch. When rebasing, it's necessary to force push to properly update the remote, as rebasing is effectively re-writing history.

Apr 30 '24 14:04 amyeroberts

HI @amyeroberts , thks 4 your guidance i force pushed back, and fixed some workflow ci errors. see my "Files changed", it's the minimum modification.

Apr 30 '24 20:04 eigen2017

@amyeroberts @ArthurZucker now all ci passed, pls review my code and merg, thks again to @amyeroberts for the instructions. i explain here again that this pr will not affect any existing logic when mlp_only_layers defaults to empty list. and mlp_only_layers can be a good setting field that make qwen moe models to fit in poor gpus.

Apr 30 '24 20:04 eigen2017

@amyeroberts HI, many thanks to all the reviews! i will modify my code to match the suggestions. i still hold that, mlp_only_layers is not a new feature, it's only make decoder_sparse_step more flexible. just think why qwen moe model exposed decoder_sparse_step to users? it's means the original architecture supports on cutting specified layers' experts during finetuning and inference. or other creative scenarios, like task1 only use some layers' exports to finetune and infer, task2 use other layers, to improve multi-task effects. or use different data to finetune different layers' exports to make model have more generalization ability. or for my scenario, need trade off some accuracy for HBM, yes i can use decoder_sparse_step to cut exports, but the minimal cut is 12 from 24 layers, and then the model becomes dumb and hard to finetune or inference. no matter what scenarios decoder_sparse_step participants, mlp_only_layers can theoretically have better performance to decoder_sparse_step, and can support more scenarios .

i saw so many associated feature or issue request during my research on this, someone even requires vllm to support cpu offload inference just like deepspeed, btw deepspeed recently merged my code, i have 10 years experiences in AI research, recently i made huggingface framework work on ascend NPUs(HW gpu).

pls give a deep thought to this pr and conversations before decided merg or not , i sincerely want to contribute to great huggingface, thanks again.^_^

May 01 '24 18:05 eigen2017

@amyeroberts HI~ code modified acording to your suggestion, thks!! and ci errors cleared again. pls help contact @ArthurZucker , thanks~ and qwen moe was first commited by @bozheng-hit too, so @bozheng-hit , if you see this , please give a reply and share your opinion~

May 04 '24 13:05 eigen2017

this model is constructed by alibaba as i know , if there are members from alibaba to make a confirmations is welcomed too

May 04 '24 13:05 eigen2017

@huybery HI ! as i found, you are qwen member, pls help to check this pr and give some opinion, thks

May 04 '24 13:05 eigen2017

@ArthurZucker HI, pls give this a review , thks..

May 07 '24 01:05 eigen2017

I actually think this is a lot simpler than the steps we used. Could be added for all MoE model, but this one is the only one that needs it for now!

many thanks for this confirmation!! immediately i'll commit to fit your review.

May 08 '24 02:05 eigen2017

@ArthurZucker @amyeroberts all your reviews are accepted and modified， CI also passed， please help to merg this pr. thank you again, and thanks to greate huggingface~~~ ^_^

May 08 '24 03:05 eigen2017

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

May 08 '24 15:05 HuggingFaceDocBuilderDev

@ArthurZucker HI.. code updated to your review suggestions..

May 09 '24 15:05 eigen2017

@ArthurZucker @amyeroberts HI~, i finished my senario test and below is the report, i think it's helpful to record here.

	no model	glm finetune	moe finetune	moe finetune, then cut exp	moe cut exp, then finetune
acc	48.15%	53.69%	54.66%	50.38%	51.91%
gen	NA	50 tok/s	85 tok/s	105 tok/s	105 tok/s
weight	NA	12GB	27GB	22GB	22GB
gpu	NA	1	4	2	2

overview:

concepts	explaination
task	use glm or moe model to summarize and rewrite long user query, then pass to a faq system, hope to get better faq accuracy.
data set	non-public, for specified scenarios
infer tech	vllm, which requires more HBM for kv blocks, that's why orignal 27GB moe cannot load in 2 gpus
no model	pass the long user query directly to faq system
glm	chatglm3-6b
glm finetune	glm lora fintune, infer, then pass generated query to faq system
moe	Qwen1.5-MoE-A2.7B-Chat
moe finetune	moe lora finetune, infer, then pass generated query to faq system
moe finetune, then cut exp	moe lora finetune, cut exp(set mlp_only_layers to [12,14,16,18]), infer, then pass generated query to faq system
moe cut exp, then finetune	moe cut exp(set mlp_only_layers to [12,14,16,18]), then lora finetune, infer, then pass generated query to faq system

concepts:

concepts	explaination
acc	the final faq accuracy.
gen	token generation speed when inference only one user query's shortter form.
weight	minimal HBM requirements of one instance of the model
gpu	minimal requirements of nvidia v100-16G pices

conclusions:

moe has better acc and gen speed than glm, but need more HBM. it's trade HBM space for speed. yes 2gpus can load 2 instances of glm, but it can only doubles the tps, for only one request, moe is as twice speed as glm.
cut exp causes decline of acc, but fewer gpus requirements and higher generation speed.
cut exp before funetuning can rise acc, as compared to cut exp after funetuning.

table above "cut exp" means cut 4 layers: [12,14,16,18], i also tried --enforce-eager true for vllm, it's means disable cuda graph feature, to trade generate speed for fewer HBM requirement. then i can cut only one layer's experts(only the 12th layer). generation speed declined to 60 tok/s and acc rised to 53.43% on 2 gpus.

it's only my senario using mlp_only_layers, this flexible config field can create many more other usages i think.

May 10 '24 10:05 eigen2017

Thanks for iterating!

it's my honor and pleasure !

May 10 '24 12:05 eigen2017

🤗

May 10 '24 12:05 ArthurZucker

mlp_only_layers is more flexible than decoder_sparse_step