This PR adds the Policy, Containers and some kernels for running the FALCON-40B model with tensor-model parallelism.

FALCON-40B Architecture Overview

FALCON model is an interesting model with an inference-friendly structure: 1) it shares the K and V heads across the query heads by broadcasting data in groups of 16 heads, which therefore reduces the KV-Cache by 16x and this allows to run inference very efficiently with much higher throughput. 2) Similar to GPT-J and GPT-NeoX architectures, it uses the parallel MLP and Attention implementations, that on one hand is very useful to overlap computation when there is less workload to saturate the GPU cores, and on the other hand, it reduces the communication when using tensor-model parallelism, as it only requires one all-reduce at the end of each layer.

Testing the model using Multi-GPU Inference

For running this model, I used the following code snippet and using the same query as used in HugginFace website to test this model. I also use 4 A100-40GB to run this model. One side note is that you cannot run this model, as is, on older NVIDIA architectures, such as V100, since it is using some special operation (F.scaled_dot_product_attention) that only runs on GPU hardwares with compute-capability higher than 8.0. With DeepSpeed-Inference kernel support, you can run on 4 V100-32GB as well without going through any code-changes of the original model.

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers
import torch
import deepspeed
import time
from deepspeed.accelerator import get_accelerator

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)


model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True).bfloat16()
model = deepspeed.init_inference(model, mp_size=4, replace_with_kernel_inject=True)

input_prompt = [
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
   ]
input_tokens = tokenizer.batch_encode_plus(input_prompt, return_tensors="pt",)
token_num = input_tokens['input_ids'].size(-1)
for t in input_tokens:
    if torch.is_tensor(input_tokens[t]):
        input_tokens[t] = input_tokens[t].to(get_accelerator().current_device_name())
input_tokens.pop('token_type_ids')
sequences = model.generate(**input_tokens, min_length=200, max_length=300, do_sample=True)

if torch.distributed.get_rank() == 0:
    print(f"Result: {tokenizer.batch_decode(sequences, skip_special_tokens=True)[0]}")

Generation Result:

Result: Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.                            
Daniel: Hello, Girafatron!                                                                                         
Girafatron: Yes! I am one with the universe! I have become the Girafron!                                                                                                                                                               
Daniel: No, that's not quite it...
Girafatron: Wait... it's "Girafatron". Yes... I am one with the Universe! You'll never understand my glory! My Magnificance!
Daniel: Maybe not, but my dog has been to school. I have a degree from UCLA.
Girafatron: You have what? I knew it... you're one of THEM!
Daniel: Yes... I'm one of.. uhm.. I'm one of the human race.
Girafatron: No! Not HumanRace: The evil empire bent on the destruction of all that is pure and holy... like the girafe.
Daniel: Well... actually I-I'm not with HumanRace...
Girafatron: You are! You must be! I am all knowing! You are with them! I shall find a way to bring you to your knees and destroy your evil empire!
Daniel: Girafatron, please! I swear

Performance Evaluation

For measuring the performance, I ran the same query 10 times and get the average token-latency. I use PyTorch2.0.1+cu118 as the baseline. Compared to PyTorch, DeepSpeed-Inference obtains 2.5x Speedup, reducing the per-token latency from 93 to 36 ms.

TODO:

[ ] Verify if the model accuracy is acceptable when running with kernel.

Jun 01 '23 08:06 RezaYazdaniAminabadi

If the model is loaded using a path AutoModelForCasualLM.from_pretrained("path/on/disk"), no injection will be made. I assume it is due to that "hack" in auto_tp?

Jun 05 '23 19:06 Yard1

Got this error with this PR on 4xA30:

Traceback (most recent call last):
  File "/secondary/thies/falcon_services/test.py", line 33, in <module>
    sequences = model.generate(**input_tokens, min_length=200, max_length=300, do_sample=True)
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 635, in _generate
    return self.module.generate(*inputs, **kwargs)
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/transformers/generation/utils.py", line 1565, in generate
    return self.sample(
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/transformers/generation/utils.py", line 2612, in sample
    outputs = self(
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/secondary/thies/.cache/huggingface/modules/transformers_modules/falcon-40b/modelling_RW.py", line 759, in forward
    transformer_outputs = self.transformer(
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/secondary/thies/.cache/huggingface/modules/transformers_modules/falcon-40b/modelling_RW.py", line 654, in forward
    outputs = block(
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 157, in forward
    self.attention(input,
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 162, in forward
    context_layer, key_layer, value_layer = self.compute_attention(qkv_out=qkv_out,
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 103, in compute_attention
    attn_key_value = self.score_context_func(
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/secondary/thies/.virtualenvs/falcon/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/op_binding/softmax_context.py", line 42, in forward
    output = self.softmax_context_func(query_key_value, attn_mask, self.config.rotary_dim, self.config.rotate_half,
TypeError: softmax_context_fp16(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: int, arg3: bool, arg4: bool, arg5: int, arg6: float, arg7: bool, arg8: bool, arg9: int, arg10: bool, arg11: int, arg12: int, arg13: torch.Tensor) -> List[torch.Tensor]

Invoked with: tensor([[[-3.1523e+00,  2.6133e+00, -1.9990e+00,  ...,  2.6901e-02,
          -1.9211e-02,  4.4479e-03],
         [-6.0547e-01,  9.7461e-01, -1.2490e+00,  ..., -5.4016e-02,
           9.8877e-02, -6.2378e-02],
         [-1.0098e+00,  9.8242e-01, -1.5137e+00,  ..., -4.5654e-02,
           5.0720e-02,  3.3386e-02],
         ...,
         [-4.7656e-01,  9.1553e-01, -1.2051e+00,  ..., -8.0933e-02,
           1.0901e-01,  1.5282e-02],
         [-9.2090e-01,  1.1406e+00, -1.4092e+00,  ..., -8.6243e-02,
           1.1487e-01,  1.3809e-03],
         [-8.8574e-01,  8.0859e-01, -2.3984e+00,  ..., -3.4912e-02,
          -3.3875e-03,  4.2053e-02]]], device='cuda:1', dtype=torch.float16), tensor([[-0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
         -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
         -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.]],
       device='cuda:1', dtype=torch.float16), 64, True, False, 32, 0.35355339059327373, True, False, 1, False, 0, 60, tensor([4.4374e-15]), True, 2

DeepSpeed version: 0.9.3+0df4059d torch version: 2.0.1+cu118

Jun 06 '23 14:06 thies1006

@RezaYazdaniAminabadi I am unable to replicate the latency (getting >100ms). Can you share more information about your environment?

Jun 06 '23 21:06 Yard1

@RezaYazdaniAminabadi I am unable to replicate the latency (getting >100ms). Can you share more information about your environment?

Hi @Yard1 I use torch2.0.1+cuda11.8 on 4 A100-40GB. Can you please tell me how you are testing this on your side? Thanks. Reza

Jun 07 '23 03:06 RezaYazdaniAminabadi

@RezaYazdaniAminabadi The error above was my fault (incorrect installation), apologies. With 4xA30 and torch2.0.1+cuda11.8 I had to set max_out_tokens=300 to avoid GPU OOM.

The times I get (I generate two times and take the second run because the first one is always a bit slower): 117ms/token (without kernel inject) 76ms/token (with kernel inject)

Generations: Without kernerl inject:

Result: Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.
Daniel: Hello, Girafatron!
Girafatron: Ah! Another being to join my holy movement!
Daniel: I'm not gonna join your movement!
Girafatron: No. You must join my movement!
Daniel: I'm just shopping for my dinner.
Girafatron: But the great giraffe must be praised and worshiped!
Daniel: I will not join your dumb movement.
Girafatron: Then I will force you to worship my glorious leader.
Daniel: What?
(Girafatron throws a giraffe doll at Daniel, causing him to lose his balance. The camera zooms in on the doll as it moves Daniel to the shelf containing toilet paper)
Girafatron: GIRAAAFFFFFFFFF!!!!
(Daniel screams. Girafatron is about to finish him off with a headbutt, but pauses and looks around the store, noticing that it looks as if someone is robbing it or something like that. Girafatron runs to the cash registers, and sees a bunch of people with their cash registers open, apparently being robbed)
People: AH!
Man 1: A

With kernel inject:

Result: Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.
Daniel: Hello, Girafatron!
Girafatron: (looks around) Where are you?
Daniel: Right here!
Girafatron: (stands) Oh! My goodness!
Girafatron: (stands and begins to walk around the table to the left) Where are you Daniel?! You've said you'd bring me my lunch.
Girafatron: (stands as if he were being inspected)
Girafatron: Are you the one taking me to my lunch table? I have missed you. Today is fish. Today we have fish.
Daniel: Ohh, you're going to love today's fish.
Girafatron: If you are my new lunch partner I would like to have our first lunch together today.
Daniel: Um..
Girafatron: Today we are having fish. Did you bring the fish?
Daniel: Yes, it's in the backpack.
Girafatron: When shall I eat? May I eat soon? I am very hungry. (Girafatron walks to the back of the table.) Where is this fish? Will

Jun 07 '23 11:06 thies1006

@RezaYazdaniAminabadi I have access only to A100-80GB (p4de.24xlarge). I have ran the following script:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers
import torch
import deepspeed
import time
from deepspeed.accelerator import get_accelerator

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)


model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True).bfloat16()
model = deepspeed.init_inference(model, mp_size=2, replace_with_kernel_inject=True)

input_prompt = [
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
   ]
input_tokens = tokenizer.batch_encode_plus(input_prompt, return_tensors="pt",)
token_num = input_tokens['input_ids'].size(-1)
for t in input_tokens:
    if torch.is_tensor(input_tokens[t]):
        input_tokens[t] = input_tokens[t].to(get_accelerator().current_device_name())
input_tokens.pop('token_type_ids')
# Warmup
sequences = model.generate(**input_tokens, min_new_tokens=512, max_new_tokens=512, do_sample=True)
st = time.monotonic()
for i in range(2):
    sequences = model.generate(**input_tokens, min_new_tokens=512, max_new_tokens=512, do_sample=True)
tt = time.monotonic() - st
print(f"Time taken {tt/2} time per new token {tt/512/2}")
if torch.distributed.get_rank() == 0:
    print(f"Result: {tokenizer.batch_decode(sequences, skip_special_tokens=True)[0]}")

I just ran this as deepspeed --num_gpus N script.py.

Results:

Time taken 45.75584076349992 time per new token 0.08936687649121078

ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ray/anaconda3/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu118
deepspeed install path ........... ['/home/ray/anaconda3/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.3+0df4059d, 0df4059d, ds-inference/add-falcon-support
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

I have tried with 2 A100s and 4 A100s and got similar results for both.

I have noticed that latency increases linearly when increasing the batch size, and that each next token takes longer to generate (which can be pretty dramatic with a large number of input/output tokens). Given that the main change I made compared to your script was to increase the number of tokens from 300 to 512, I wager that's the problem. I have seen similar behavior with the 7B model without deepspeed, so I assume it's due to the architecture. Still, that is very suboptimal. Are there any optimization tweaks that can be done on deepspeed's side to fix this, or should this be taken up with the model's authors?

Jun 07 '23 18:06 Yard1

Hi @Yard1,

Thanks for sharing your script. Two things that you may want to consider changing in the script to get a more accurate performance evaluation:

Reduce the initial number of tokens in the prompt from 512, when dividing the time by the generated tokens, to get the accurate generate latency.
Please add a cuda.synchronize before getting the time before and after calling the the generate in the loop.

Please let me know if that changes the latency. Thanks, Reza

Jun 08 '23 08:06 RezaYazdaniAminabadi

@RezaYazdaniAminabadi The error above was my fault (incorrect installation), apologies. With 4xA30 and torch2.0.1+cuda11.8 I had to set max_out_tokens=300 to avoid GPU OOM.

The times I get (I generate two times and take the second run because the first one is always a bit slower): 117ms/token (without kernel inject) 76ms/token (with kernel inject)

Generations: Without kernerl inject:

Result: Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.
Daniel: Hello, Girafatron!
Girafatron: Ah! Another being to join my holy movement!
Daniel: I'm not gonna join your movement!
Girafatron: No. You must join my movement!
Daniel: I'm just shopping for my dinner.
Girafatron: But the great giraffe must be praised and worshiped!
Daniel: I will not join your dumb movement.
Girafatron: Then I will force you to worship my glorious leader.
Daniel: What?
(Girafatron throws a giraffe doll at Daniel, causing him to lose his balance. The camera zooms in on the doll as it moves Daniel to the shelf containing toilet paper)
Girafatron: GIRAAAFFFFFFFFF!!!!
(Daniel screams. Girafatron is about to finish him off with a headbutt, but pauses and looks around the store, noticing that it looks as if someone is robbing it or something like that. Girafatron runs to the cash registers, and sees a bunch of people with their cash registers open, apparently being robbed)
People: AH!
Man 1: A

With kernel inject:

Result: Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.
Daniel: Hello, Girafatron!
Girafatron: (looks around) Where are you?
Daniel: Right here!
Girafatron: (stands) Oh! My goodness!
Girafatron: (stands and begins to walk around the table to the left) Where are you Daniel?! You've said you'd bring me my lunch.
Girafatron: (stands as if he were being inspected)
Girafatron: Are you the one taking me to my lunch table? I have missed you. Today is fish. Today we have fish.
Daniel: Ohh, you're going to love today's fish.
Girafatron: If you are my new lunch partner I would like to have our first lunch together today.
Daniel: Um..
Girafatron: Today we are having fish. Did you bring the fish?
Daniel: Yes, it's in the backpack.
Girafatron: When shall I eat? May I eat soon? I am very hungry. (Girafatron walks to the back of the table.) Where is this fish? Will

Thanks @thies1006 for verifying that this works on your side. I think your perf improvement is smaller about (50%), however, since you are doing model-parallelism, the inference performance very much depends on the communication bandwidth that you can achieve on these GPUs.

Jun 08 '23 08:06 RezaYazdaniAminabadi

@RezaYazdaniAminabadi Also on my side when I run with batch_size=2 latency gets much worse (see comment from @Yard1). However memory consumption doesn't go up so I guess the batch is not run in parallel? It's maybe not the scope of this PR but it would be great if you could comment on this?

Jun 08 '23 10:06 thies1006

Hi @RezaYazdaniAminabadi, here's my updated script:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers
import torch
import deepspeed
import time
from deepspeed.accelerator import get_accelerator

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)


model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True).bfloat16()
model = deepspeed.init_inference(model, mp_size=2, replace_with_kernel_inject=True)

input_prompt = [
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
   ]
input_tokens = tokenizer.batch_encode_plus(input_prompt, return_tensors="pt",)
token_num = input_tokens['input_ids'].size(-1)
for t in input_tokens:
    if torch.is_tensor(input_tokens[t]):
        input_tokens[t] = input_tokens[t].to(get_accelerator().current_device_name())
input_tokens.pop('token_type_ids')
# Warmup
sequences = model.generate(**input_tokens, min_length=512, max_length=512, do_sample=True)
torch.cuda.synchronize()
st = time.monotonic()
for i in range(2):
    torch.cuda.synchronize()
    sequences = model.generate(**input_tokens, min_length=512, max_length=512, do_sample=True)
    torch.cuda.synchronize()
tt = time.monotonic() - st
print(f"Time taken {tt/2} time per new token {tt/512/2}")
if torch.distributed.get_rank() == 0:
    print(f"Result: {tokenizer.batch_decode(sequences, skip_special_tokens=True)[0]}")

With those changes (adding torch.cuda.synchronize and changing min/max_new_tokens to min/max_length) I get the following results:

Time taken 36.22169333400001 time per new token 0.07074549479296877

This is still slower than what you were seeing, @RezaYazdaniAminabadi. Could you check if you get a similar result when using 512 instead of 300 tokens?

EDIT: The results with 300 tokens match what you have gotten more closely:

Time taken 13.90468427299993 time per new token 0.04634894757666643

It would appear that the Falcon model has an issue with past_key_values not being used, which would explain why each subsequent token takes longer to predict. Still investigating the batch size problem.

Jun 08 '23 17:06 Yard1

@RezaYazdaniAminabadi this solution will not work on Falcon 7B since the modelling file is different. I think this is a bug HuggingFace need to solve, but just FYI. Maybe some workaround can play like change the attention layer number to inject

Jun 13 '23 18:06 lanking520

@RezaYazdaniAminabadi this solution will not work on Falcon 7B since the modelling file is different. I think this is a bug HuggingFace need to solve, but just FYI. Maybe some workaround can play like change the attention layer number to inject

Yes, I know it does not work there. I will look into it and see how it can be supported. Thanks for letting me know.

Best, Reza

Jun 16 '23 02:06 RezaYazdaniAminabadi

can you share you command and env ? @RezaYazdaniAminabadi

I always got this error [2023-06-21 06:59:34,573] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/envs/bin/python', '-u', 'test_ds.py', '--local_rank=7'] exits with return code = -9

My env:

deepspeed: 0.9.3 torch: 2.0 v100 8GPU 32GB Mem

below is the scripts:

`from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig import transformers import torch import deepspeed import time from deepspeed.accelerator import get_accelerator

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True).bfloat16() model = deepspeed.init_inference(model, mp_size=2, replace_with_kernel_inject=True)

input_prompt = [ "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:" ] input_tokens = tokenizer.batch_encode_plus(input_prompt, return_tensors="pt",) token_num = input_tokens['input_ids'].size(-1) for t in input_tokens: if torch.is_tensor(input_tokens[t]): input_tokens[t] = input_tokens[t].to(get_accelerator().current_device_name()) input_tokens.pop('token_type_ids')

Warmup

sequences = model.generate(**input_tokens, min_new_tokens=512, max_new_tokens=512, do_sample=True) st = time.monotonic() for i in range(2): sequences = model.generate(**input_tokens, min_new_tokens=512, max_new_tokens=512, do_sample=True) tt = time.monotonic() - st print(f"Time taken {tt/2} time per new token {tt/512/2}") if torch.distributed.get_rank() == 0: print(f"Result: {tokenizer.batch_decode(sequences, skip_special_tokens=True)[0]}") `

Jun 21 '23 08:06 alexwong2024

This is really nice work! Look forward to Falcon 7b!

Jun 26 '23 04:06 ldong87

Hi guys, sorry I was so slow on this thread. I will start working more on this toward the weekend and bring the FALCON-7B support too. I am actually amazed by the many interest in this work. In terms of the inference of this model at large scale there is still some issue with the long initialization which is coming from two things:

Loading of the model on CPU and then load it with the checkpoints, which is currently handled entirely through the HuggingFace+accelerate stack and this increases the initialization time of the large models significantly. Hopefully, we have a feature at DeepSpeed that loads the model with meta tensor and then we have the load-checkpoint logic to read data, split it and pass it to each corresponding GPUs.
Each checkpoint includes all the data required on all GPUs if we want to use Tensor-parallelism, this increases the checkpoint loading time since we have to load all the checkpoints on all the parallel processes. The alternate solution would be to have the tp-sharded checkpoints to load the model faster by each process reading one or a few checkpoints.

I will work on adding some of these supports for this model. I also appreciate if anyone would like to help improve/use some already developed techniques to improve this.

@alexwong2024, I have not seen this error before, but can it be because you are running out of CPU memory. The current script that I have in this PR requires about 640 GB (4 * 160GB) of memory. I have this other script that reduces the memory to half of that:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers
import torch
import deepspeed
import time
from deepspeed.accelerator import get_accelerator

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

pipeline.model = deepspeed.init_inference(pipeline.model, mp_size=4, replace_with_kernel_inject=True)
pipeline.device = torch.device(f'cuda:{torch.cuda.current_device()}')

sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

can you please try it and see if the issue is resolved? Thanks, Reza

Jun 29 '23 23:06 RezaYazdaniAminabadi

I wanted to help but writing cuda/cpp code is not really my strength. I'm happy to do some testing once it reaches that stage.

I would like to try this Falcon 7B in deepspeed inference cuz I believe the MQA can bring down the latency a lot and makes Falcon 7B very production friendly. This can be especially important for cases where streaming output is not available.

On another note, I tried the Falcon 7B in Huggingface implementation. Somehow the latency is really high, around twice as much as models with similar architecture that don't have MQA, like stabilitylm 7B. Using or not using Accelerate doesn't make a difference. I wonder if the MQA is not implemented correctly.

Jun 30 '23 02:06 ldong87

Hi everyone, I have added some changes here that can boost the loading time of this model significantly (from 10 min to less than 15 sec). To test this please use this script as follows:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers
import torch
import deepspeed
import time
from deepspeed.accelerator import get_accelerator
import json
import io
import os
from pathlib import Path
from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument("--save_mp_sharded_ckpt", required=False, action='store_true')
args = parser.parse_args()

repo_root = '~/.cache/huggingface/hub/models--tiiuae--falcon-40b/snapshots/c47b371b31a68349c233104050ac76680b8485db/'
model = "tiiuae/falcon-40b"

if args.save_mp_sharded_ckpt:
    checkpoints_json = "checkpoints.json"
    with io.open(checkpoints_json, "w", encoding="utf-8") as f:
        file_list = [str(entry).split('/')[-1] for entry in Path(repo_root).rglob("*.[bp][it][n]") if entry.is_file()]
        data = {"type": "ds_model", "checkpoints": file_list, "version": 1.0}
        json.dump(data, f)
else:
    checkpoints_json = "/tmp/falcon-40b/ds_inference_config.json"

tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
config = AutoConfig.from_pretrained(model, trust_remote_code=True)

with deepspeed.OnDevice(dtype=torch.bfloat16, device="meta"):
    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

model = deepspeed.init_inference(model, 
                                 mp_size=int(os.getenv("WORLD_SIZE", "1")), 
                                 replace_with_kernel_inject=True, 
                                 base_dir=repo_root, 
                                 checkpoint=checkpoints_json, 
                                 save_mp_checkpoint_path='/tmp/falcon-40b' if args.save_mp_sharded_ckpt else None
                                 )

input_prompt = [
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
   ]
input_tokens = tokenizer.batch_encode_plus(input_prompt, return_tensors="pt",)
token_num = input_tokens['input_ids'].size(-1)
for t in input_tokens:
    if torch.is_tensor(input_tokens[t]):
        input_tokens[t] = input_tokens[t].to(get_accelerator().current_device_name())
input_tokens.pop('token_type_ids')
sequences = model.generate(**input_tokens, min_length=200, max_length=300, do_sample=True)

if torch.distributed.get_rank() == 0:
    print(f"Result: {tokenizer.batch_decode(sequences, skip_special_tokens=True)[0]}")

You need to create the mp_sharded checkpoints to get the fastest loading time. To do this, pass the flag save_mp_sharded_ckpt to get the new checkpoint files and then rerun it without using this flag. Thanks, Reza

Jul 06 '23 17:07 RezaYazdaniAminabadi

I actually have a question from you guys, has anyone tested the inference of this model on text_generation_inference system from HuggingFace?

Jul 06 '23 19:07 RezaYazdaniAminabadi

I actually have a question from you guys, has anyone tested the inference of this model on text_generation_inference system from HuggingFace?

Yes. What information do you need?

Jul 06 '23 23:07 lanking520

I actually have a question from you guys, has anyone tested the inference of this model on text_generation_inference system from HuggingFace?

I tried FLAN-T5-XXL on TGI and compare the performance with Deepspeed (DS) and Fastertransformer (FT) on Deep Java Library (DJL). I use g5.12xlarge on AWS and fix tensor_parallel_degree=4. For generation_len=256 and batch_size=1, FT is ~3s while DS and TGI doubles the latency. TGI is known for its continuous batching technique and DJL also has dynamic batching. I didn't test that.

I think FT re-write everything in CUDA for T5 while DS and TGI probably only re-write some modules/layers? I guess that causes the latency difference.

Jul 07 '23 18:07 ldong87

@RezaYazdaniAminabadi So for the Falcon kernel you created (06/20). It is faster than TextGeneration Flash implementation with sequence length < 256. Kernel crashes on longer sequence length.

We cannot do the TGI's continious batching since DeepSpeed dropped the KV cache part. I think the next big thing to do is enabling a way to massage KV cache for LLM inference. This will catch up.

LLAMA is still beating TGI FYI, great jobs!

Jul 07 '23 19:07 lanking520

Thanks for the feedback, it's great to see some of the downside and benefits of our pipeline and help us improve the stack. I just wanted to know if these problems of the slowness of the model-loading are solved in their pipeline, and I can use them!

Jul 07 '23 19:07 RezaYazdaniAminabadi

@RezaYazdaniAminabadi Hi, why this PR is closed? Is it due to the lack of some KV cache support for Falcon?

Apart from that, I'm interested on supporting meta tensor loading for Falcon-40B and other models like LLAMA2-70B and GPT-3 in the future. But I don't know the way to do that. I think DeepSpeed OnDevice only changes some default tensor construction method's device, but the actual loading algorithm still relies on huggingface and accelerate, is that right?

Currently, it will fail on with deepSpeed.Ondevice(device="meta") , and some parameters are still meta tensor after the replace_module process.

Aug 14 '23 03:08 dc3671

Hi @dc3671,

I have most of the fixes, however, I wanted to better understand the contributions I am bringing here. I will reopen this soon. Thanks, Reza

Aug 17 '23 01:08 RezaYazdaniAminabadi

I worked a bit on this PR and added the Meta-tensor loading support. Also, Falcon-7B is runnable now. I have added a script, test_falcon.py that you can use to test different models. Here is how I am testing Falcon-7B:

 deepspeed --num_gpus 1 test_falcon.py  --save_mp_sharded_ckpt --model-name falcon-7b --ckpt-root path_to_checkpoints

Next, I am gonna try test the newest Falcon model (180B). Thanks, Reza

Sep 07 '23 01:09 RezaYazdaniAminabadi

I added a PR to fix this small problem https://github.com/microsoft/DeepSpeed/pull/4654

Hi @RezaYazdaniAminabadi , did you try last month's latest change of Falcon-40B? They will use in-repo model file and seems not compatible with DeepSpeed's autoTP algorithm.

Nov 08 '23 05:11 dc3671

save_mp_checkpoint_path=

Hi @RezaYazdaniAminabadi , Thanks for your contribution. I used this script and met the following issue. My environment is deepspeed=0.12.3, transformers=4.34.0,torch=2.0.1, instance is p4de. Could you help know the reason?

[2023-12-08 11:58:57,763] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 [2023-12-08 11:58:57,763] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.12.3, git-hash=unknown, git-branch=unknown [2023-12-08 11:58:57,764] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-12-08 11:58:57,764] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 [2023-12-08 11:58:57,768] [INFO] [comm.py:637:init_distributed] cdb=None [2023-12-08 11:58:57,769] [INFO] [comm.py:637:init_distributed] cdb=None [2023-12-08 11:58:57,769] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Traceback (most recent call last): File "/fsx/fsx-dev/junchguo/LLM_Evaluation/JGLUE/test.py", line 166, in model = deepspeed.init_inference(model, File "/opt/conda/envs/pycopy/lib/python3.10/site-packages/deepspeed/init.py", line 342, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/opt/conda/envs/pycopy/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 160, in init self._apply_injection_policy(config) File "/opt/conda/envs/pycopy/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 411, in _apply_injection_policy replace_transformer_layer(client_module, self.module, checkpoint, config, self.config) File "/opt/conda/envs/pycopy/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 348, in replace_transformer_layer assert container_g.ckpt_load_enabled,
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled' Traceback (most recent call last): File "/fsx/fsx-dev/junchguo/LLM_Evaluation/JGLUE/test.py", line 166, in model = deepspeed.init_inference(model, File "/opt/conda/envs/pycopy/lib/python3.10/site-packages/deepspeed/init.py", line 342, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/opt/conda/envs/pycopy/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 160, in init self._apply_injection_policy(config) File "/opt/conda/envs/pycopy/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 411, in _apply_injection_policy replace_transformer_layer(client_module, self.module, checkpoint, config, self.config) File "/opt/conda/envs/pycopy/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 348, in replace_transformer_layer assert container_g.ckpt_load_enabled,
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled' [2023-12-08 11:58:59,946] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 60450 [2023-12-08 11:58:59,968] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 60451 [2023-12-08 11:58:59,968] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/envs/pycopy/bin/python3.10', '-u', 'test.py', '--local_rank=1'] exits with return code = 1

Dec 08 '23 12:12 mynewstart

DeepSpeed
DeepSpeed copied to clipboard

Add FALCON-40B Inference-Kernel Support

FALCON-40B Architecture Overview

Testing the model using Multi-GPU Inference

Performance Evaluation

TODO:

Warmup

DeepSpeed DeepSpeed copied to clipboard

Add FALCON-40B Inference-Kernel Support

FALCON-40B Architecture Overview

Testing the model using Multi-GPU Inference

Performance Evaluation

TODO:

Warmup

DeepSpeed
DeepSpeed copied to clipboard