text-generation-webui Automatic model parallel inference by deepspeed

Description

The current multi-gpu setup uses a simple pipeline parallelism (PP) provided by huggingface transformers, which is inefficient because only one gpu can work at the same time. DeepSpeed-Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory. Even for smaller models, MP can be used to reduce latency for inference. https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/inference-tutorial.md

DeepSpeed-Inference provides two methods of supporting MP:

For a limited set of supported models, it will automatically partition the model as necessary, inject compatible high performance kernels into your model and manage the inter-gpu communication. The list includes GPT-NeoX, GPT-J and OPT.
Pass an injection policy that shows the two specific linear layers on a Transformer Encoder/Decoder layer: 1) the attention output GeMM and 2) layer output GeMM.

I am particularily interested in applying method 2 to LLaMA, because it is one of the largest model that textui supports that would benefit from performance improvements.

The LLaMA model has a structure like this

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear(in_features=13824, out_features=5120, bias=False)
          (up_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=5120, out_features=32000, bias=False)
)

So I think the linear layers required in method 2 is self_attn.o_proj and mlp.act_fn. I tried to setup the model like this

import deepspeed
from transformers.models.llama.modeling_llama import LlamaDecoderLayer

model = AutoModelForCausalLM.from_pretrained(Path(f"models/{shared.model_name}"))
model = deepspeed.init_inference(
    model,
    mp_size=2,
    dtype=torch.half,
    injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.up_proj')}
)

However upon running I encountered the following error:

Traceback (most recent call last):
  File "/home/sgsdxzy/Programs/text-generation-webui/server.py", line 234, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/sgsdxzy/Programs/text-generation-webui/modules/models.py", line 54, in load_model
    model = deepspeed.init_inference(
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 132, in __init__
    self._apply_injection_policy(config, client_module)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 535, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 800, in replace_module
    replaced_module, _ = _replace_module(model, policy)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 827, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 827, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 827, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 817, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 531, in replace_fn
    new_module = replace_wo_policy(child, _policy)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 514, in replace_wo_policy
    return _replace_module(module)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 511, in _replace_module
    _replace_module(child, name)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 506, in _replace_module
    linear_policies[child.__class__](child,
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 434, in _replace
    weight_shape[0] // mp_size if conv_linear_layer else weight_shape[1],
IndexError: tuple index out of range

And I find out the error arises when trying to replace the linear module self_attn.q_proj (not the output layer so it's replace_wo_policy ) and the weight_shape of the layer is torch.Size([0]). So I am totally lost and don't know where to look.

I am fresh to machine learning, I started to install and learn pytorch around when LLaMa is published weeks ago. Can anybody find out how to make MP work for LLaMa, or it is feasible in the first place on consumer gpus without high-speed nvlinks? (my 2080Tis do have a gaming nvlink, haven't tested yet)

Mar 25 '23 15:03 sgsdxzy

@sgsdxzy what has been your experience with the current deepspeed implementation? Does it not work at all for multi-gpu setups?

Mar 27 '23 00:03 oobabooga

@oobabooga Update: It seems I have to load the whole model for every process and let it chunk (before I split load the model to multiple GPUs so each process cannot find some parts). I can get the model to load now, however I encountered another problem when trying to do inference: https://github.com/microsoft/DeepSpeed/issues/3099

Mar 27 '23 02:03 sgsdxzy

Please keep me up to date. deepspeed_parameters.py and models.py are probably not configured to use deepspeed in the most general way possible and I would be interested in incorporating improvements.

Mar 27 '23 02:03 oobabooga

@oobabooga Here are some mixed news, and still very interesting:

First, I managed to get GPT-Neo and OPT to work. In fact the kernel support list includes most types textui supports (GPT-NEO/NEOX, GPT-J, GPT-2, BLOOM and OPT) so I think it is at least easy to support all these. Still cannot get Llama to work with custom injection, though.

Second, one important caveat: deepspeed-inference does not shard the model. The model has to be replicated on every device. So if one gpu cannot fit a large model, you won't be able to use it. This severely limits its usefulness. And I cannot get 8bit to work (yet) so you have to fit a fp16 model in every gpu.

On the other hand, the performance is promising. I tested using OPT-6.7B to generate 500 tokens, and the time:

Devices	Time (seconds)
2080Ti 22G x 1	46.02 ± 0.17
2080Ti 22G x 2 w/o nvlink	34.34 ± 0.06
2080Ti 22G x 2 + nvlink	27.83 ± 0.01

So without nvlink a 35% speed gain and with nvlink a 65%! Never expected tensor parallel to work so well on consumer cards.

Third, deepspeed can be a bit troublesome for current textui. It is a multiprocess structure and every process need to call model.generate() with the same parameters simutaniously. You need to keep all these processes alive and use some rpc to feed them tokenized batches to make an interactive frontend.

My current working script as an example:

import torch
import time
import os
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
model = 'PATH_TO_MODEL'
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, device_map={"": local_rank}, torch_dtype=torch.half)
model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.half,
    replace_with_kernel_inject=True
)

batch = tokenizer(
    "The primary use of LLaMA is research on large language models, including",
    return_tensors="pt", 
    add_special_tokens=False
)
batch = {k: v.cuda() for k, v in batch.items()}
if local_rank == 0:
    t0 = time.time()
generated = model.generate(batch["input_ids"], max_length=200)
if local_rank == 0:
    t1 = time.time()
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(f"Output generated in {(t1-t0):.2f} seconds", tokenizer.decode(generated[0]))

Mar 27 '23 10:03 sgsdxzy

yes, deepspeed does not support, maybe your server.py specify the port

Mar 27 '23 12:03 huangjiaheng

when use deepspeed multi gpu, only one gpu support

Mar 28 '23 02:03 huangjiaheng

@huangjiaheng 你可以用中文，我看得懂

It seems your translation software is cutting off sentences. If you struggle with English you can use Chinese.

Mar 28 '23 03:03 sgsdxzy

pipe.model = deepspeed.init_inference(pipe.model, dtype=data_type, mp_size=world_size, replace_with_kernel_inject=args.use_kernel, replace_method=args.replace_method, max_tokens=args.max_tokens, save_mp_checkpoint_path=args.save_mp_checkpoint_path, **ds_kwargs )

看了下别人的代码，如： https://github.com/huggingface/transformers-bloom-inference/blob/e970be1027afc43c147d06153635f4285c517081/bloom-inference-scripts/bloom-ds-inference.py 或者 https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py

他们都可以支持多gpu分布式显存。

Mar 28 '23 06:03 huangjiaheng

@huangjiaheng 你可以用中文，我看得懂

It seems your translation software is cutting off sentences. If you struggle with English you can use Chinese.

哈哈，你解决了么？

Mar 28 '23 06:03 huangjiaheng

Update: I can get split loading to work according to example https://github.com/huggingface/transformers-bloom-inference/blob/e970be1027afc43c147d06153635f4285c517081/bloom-inference-scripts/bloom-ds-inference.py but int8 and llama is still not working yet

Mar 30 '23 14:03 sgsdxzy

With the help from https://github.com/microsoft/DeepSpeed/issues/3099, I managed to make tensor parallel inference working for Llama! However I noticed that without a custom optimized kernel, the performance does not scale: 2080Ti 22G x 2 have the same tokens/s as 2080Ti 22G x 1, so we gain nothing from TP instead of current naive model parallel. Still investigating, but I am wondering if it's worth it for Llama. OPT/GPT-J/etc can surely gain performance from this.

Mar 30 '23 15:03 sgsdxzy

Any updates for deepspeed --num_gpus 2 server.py to work?

Apr 18 '23 05:04 richarddwang

try to use 'self_attn.o_proj', 'mlp.down_proj'?

May 27 '23 03:05 mynewstart

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Nov 28 '23 23:11 github-actions[bot]