DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

How to set batch size for using deepspeed inference API?

Open gtvforever opened this issue 2 years ago • 2 comments

Hi folks,

I followed the tutorial from https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference and wrote below code to run gpt2-xl inference.

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
#generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
generator = pipeline('text-generation', model='gpt2-xl',
                     device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float16,
                                           replace_method='auto',
                                           replace_with_kernel_inject=True)

string = generator("DeepSpeed is the", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

It works well on my gpu and I'm wondering that how i can set the batch size for such inference?

Does the deepspeed inference API support setting the batch size parameter ?

Thank you in advance.

gtvforever avatar Jul 08 '22 10:07 gtvforever

@gtvforever you can pass multiple items to the pipeline like so: outputs = generator(["DeepSpeed is the", "It's a DeepSpeed kind of summer"], do_sample=True, min_length=50)

If you are looking to the use transformers.pipeline(batch_size=..._) parameter, it should work with DeepSpeed. But note that it can sometimes negatively impact performance (further information here: https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching)

mrwyattii avatar Jul 11 '22 17:07 mrwyattii

@gtvforever you can pass multiple items to the pipeline like so: outputs = generator(["DeepSpeed is the", "It's a DeepSpeed kind of summer"], do_sample=True, min_length=50)

If you are looking to the use transformers.pipeline(batch_size=..._) parameter, it should work with DeepSpeed. But note that it can sometimes negatively impact performance (further information here: https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching)

hI @mrwyattii Thanks for your suggestion. Below is my findings per your advice.

  1. Using the parameter batch_size of API transformers.pipeline(batch_size=..._) doesn't work. It means when i tried this parameter with different value(1/8/16), i found the number of executed cuda kernel are constant and the average decoder block perf during different bs are the same(Maybe it's not too strict to say the same, but the data shows that there is no big gap between different bs)。
  2. I also tried the code like outputs = generator(["DeepSpeed is the", "It's a DeepSpeed kind of summer"], do_sample=True, min_length=50) .I was using the nsight system to dump the kernel execution status.However, it's weird that the batch size seems doesn't works as expectation. Specifically, i suppose the kernels of decoder block are constant but it increased with input batch size simutaneously. Say for bs1, the model of gpt xl would call about 6000 cuda kernels for inference. Once we set the batch size to 16, then we observed the 96000 kernels (16x of bs 1)exectued. I'm wondering that it's expectation behavior or not.

gtvforever avatar Jul 12 '22 10:07 gtvforever

Hi @gtvforever, this is a bit not obvious on the HF side. Here is how we've accomplished this (with or without deepspeed):

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_name = "gpt2-xl"

# Construct input prompts, in this case batch size will be 2
input_prompts = ['DeepSpeed is', "Seattle is in Washington"]

# Construct the tokenizer to encode w. padding if token counts differ across prompts
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)

pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0)

##
# you can add deepspeed.init_inference here to accelerate the model
##

response = pipe(input_prompts, do_sample=True, batch_size=len(input_prompts))

for r in response:
    print(r)

jeffra avatar Dec 02 '22 22:12 jeffra

@gtvforever please re-open if this isn't resolved.

jeffra avatar Dec 12 '22 18:12 jeffra