Automatic model parallel inference by deepspeed
Description
The current multi-gpu setup uses a simple pipeline parallelism (PP) provided by huggingface transformers, which is inefficient because only one gpu can work at the same time. DeepSpeed-Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory. Even for smaller models, MP can be used to reduce latency for inference. https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/inference-tutorial.md
DeepSpeed-Inference provides two methods of supporting MP:
- For a limited set of supported models, it will automatically partition the model as necessary, inject compatible high performance kernels into your model and manage the inter-gpu communication. The list includes GPT-NeoX, GPT-J and OPT.
- Pass an injection policy that shows the two specific linear layers on a Transformer Encoder/Decoder layer: 1) the attention output GeMM and 2) layer output GeMM.
I am particularily interested in applying method 2 to LLaMA, because it is one of the largest model that textui supports that would benefit from performance improvements.
The LLaMA model has a structure like this
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 5120, padding_idx=0)
(layers): ModuleList(
(0-39): 40 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=5120, out_features=5120, bias=False)
(k_proj): Linear(in_features=5120, out_features=5120, bias=False)
(v_proj): Linear(in_features=5120, out_features=5120, bias=False)
(o_proj): Linear(in_features=5120, out_features=5120, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=5120, out_features=13824, bias=False)
(down_proj): Linear(in_features=13824, out_features=5120, bias=False)
(up_proj): Linear(in_features=5120, out_features=13824, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=5120, out_features=32000, bias=False)
)
So I think the linear layers required in method 2 is self_attn.o_proj and mlp.act_fn. I tried to setup the model like this
import deepspeed
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
model = AutoModelForCausalLM.from_pretrained(Path(f"models/{shared.model_name}"))
model = deepspeed.init_inference(
model,
mp_size=2,
dtype=torch.half,
injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.up_proj')}
)
However upon running I encountered the following error:
Traceback (most recent call last):
File "/home/sgsdxzy/Programs/text-generation-webui/server.py", line 234, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/home/sgsdxzy/Programs/text-generation-webui/modules/models.py", line 54, in load_model
model = deepspeed.init_inference(
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 132, in __init__
self._apply_injection_policy(config, client_module)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
replace_transformer_layer(client_module,
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 535, in replace_transformer_layer
replaced_module = replace_module(model=model,
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 800, in replace_module
replaced_module, _ = _replace_module(model, policy)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 827, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 827, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 827, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 817, in _replace_module
replaced_module = policies[child.__class__][0](child,
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 531, in replace_fn
new_module = replace_wo_policy(child, _policy)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 514, in replace_wo_policy
return _replace_module(module)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 511, in _replace_module
_replace_module(child, name)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 506, in _replace_module
linear_policies[child.__class__](child,
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 434, in _replace
weight_shape[0] // mp_size if conv_linear_layer else weight_shape[1],
IndexError: tuple index out of range
And I find out the error arises when trying to replace the linear module self_attn.q_proj (not the output layer so it's replace_wo_policy ) and the weight_shape of the layer is torch.Size([0]). So I am totally lost and don't know where to look.
I am fresh to machine learning, I started to install and learn pytorch around when LLaMa is published weeks ago. Can anybody find out how to make MP work for LLaMa, or it is feasible in the first place on consumer gpus without high-speed nvlinks? (my 2080Tis do have a gaming nvlink, haven't tested yet)
@sgsdxzy what has been your experience with the current deepspeed implementation? Does it not work at all for multi-gpu setups?
@oobabooga Update: It seems I have to load the whole model for every process and let it chunk (before I split load the model to multiple GPUs so each process cannot find some parts). I can get the model to load now, however I encountered another problem when trying to do inference: https://github.com/microsoft/DeepSpeed/issues/3099
Please keep me up to date. deepspeed_parameters.py and models.py are probably not configured to use deepspeed in the most general way possible and I would be interested in incorporating improvements.
@oobabooga Here are some mixed news, and still very interesting:
First, I managed to get GPT-Neo and OPT to work. In fact the kernel support list includes most types textui supports (GPT-NEO/NEOX, GPT-J, GPT-2, BLOOM and OPT) so I think it is at least easy to support all these. Still cannot get Llama to work with custom injection, though.
Second, one important caveat: deepspeed-inference does not shard the model. The model has to be replicated on every device. So if one gpu cannot fit a large model, you won't be able to use it. This severely limits its usefulness. And I cannot get 8bit to work (yet) so you have to fit a fp16 model in every gpu.
On the other hand, the performance is promising. I tested using OPT-6.7B to generate 500 tokens, and the time:
| Devices | Time (seconds) |
|---|---|
| 2080Ti 22G x 1 | 46.02 ± 0.17 |
| 2080Ti 22G x 2 w/o nvlink | 34.34 ± 0.06 |
| 2080Ti 22G x 2 + nvlink | 27.83 ± 0.01 |
So without nvlink a 35% speed gain and with nvlink a 65%! Never expected tensor parallel to work so well on consumer cards.
Third, deepspeed can be a bit troublesome for current textui. It is a multiprocess structure and every process need to call model.generate() with the same parameters simutaniously. You need to keep all these processes alive and use some rpc to feed them tokenized batches to make an interactive frontend.
My current working script as an example:
import torch
import time
import os
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
model = 'PATH_TO_MODEL'
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, device_map={"": local_rank}, torch_dtype=torch.half)
model = deepspeed.init_inference(
model,
mp_size=world_size,
dtype=torch.half,
replace_with_kernel_inject=True
)
batch = tokenizer(
"The primary use of LLaMA is research on large language models, including",
return_tensors="pt",
add_special_tokens=False
)
batch = {k: v.cuda() for k, v in batch.items()}
if local_rank == 0:
t0 = time.time()
generated = model.generate(batch["input_ids"], max_length=200)
if local_rank == 0:
t1 = time.time()
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(f"Output generated in {(t1-t0):.2f} seconds", tokenizer.decode(generated[0]))
yes, deepspeed does not support, maybe your server.py specify the port
when use deepspeed multi gpu, only one gpu support
@huangjiaheng 你可以用中文,我看得懂
It seems your translation software is cutting off sentences. If you struggle with English you can use Chinese.
pipe.model = deepspeed.init_inference(pipe.model, dtype=data_type, mp_size=world_size, replace_with_kernel_inject=args.use_kernel, replace_method=args.replace_method, max_tokens=args.max_tokens, save_mp_checkpoint_path=args.save_mp_checkpoint_path, **ds_kwargs )
看了下别人的代码,如: https://github.com/huggingface/transformers-bloom-inference/blob/e970be1027afc43c147d06153635f4285c517081/bloom-inference-scripts/bloom-ds-inference.py 或者 https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py
他们都可以支持多gpu分布式显存。
@huangjiaheng 你可以用中文,我看得懂
It seems your translation software is cutting off sentences. If you struggle with English you can use Chinese.
哈哈,你解决了么?
Update: I can get split loading to work according to example https://github.com/huggingface/transformers-bloom-inference/blob/e970be1027afc43c147d06153635f4285c517081/bloom-inference-scripts/bloom-ds-inference.py but int8 and llama is still not working yet
With the help from https://github.com/microsoft/DeepSpeed/issues/3099, I managed to make tensor parallel inference working for Llama! However I noticed that without a custom optimized kernel, the performance does not scale: 2080Ti 22G x 2 have the same tokens/s as 2080Ti 22G x 1, so we gain nothing from TP instead of current naive model parallel. Still investigating, but I am wondering if it's worth it for Llama. OPT/GPT-J/etc can surely gain performance from this.
Any updates for deepspeed --num_gpus 2 server.py to work?
try to use 'self_attn.o_proj', 'mlp.down_proj'?
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.