DeepSpeed
DeepSpeed copied to clipboard
How to set batch size for using deepspeed inference API?
Hi folks,
I followed the tutorial from https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference and wrote below code to run gpt2-xl inference.
import os
import deepspeed
import torch
from transformers import pipeline
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
#generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
generator = pipeline('text-generation', model='gpt2-xl',
device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.float16,
replace_method='auto',
replace_with_kernel_inject=True)
string = generator("DeepSpeed is the", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)
It works well on my gpu and I'm wondering that how i can set the batch size for such inference?
Does the deepspeed inference API support setting the batch size parameter ?
Thank you in advance.
@gtvforever you can pass multiple items to the pipeline like so:
outputs = generator(["DeepSpeed is the", "It's a DeepSpeed kind of summer"], do_sample=True, min_length=50)
If you are looking to the use transformers.pipeline(batch_size=..._)
parameter, it should work with DeepSpeed. But note that it can sometimes negatively impact performance (further information here: https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching)
@gtvforever you can pass multiple items to the pipeline like so:
outputs = generator(["DeepSpeed is the", "It's a DeepSpeed kind of summer"], do_sample=True, min_length=50)
If you are looking to the use
transformers.pipeline(batch_size=..._)
parameter, it should work with DeepSpeed. But note that it can sometimes negatively impact performance (further information here: https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching)
hI @mrwyattii Thanks for your suggestion. Below is my findings per your advice.
- Using the parameter batch_size of API
transformers.pipeline(batch_size=..._)
doesn't work. It means when i tried this parameter with different value(1/8/16), i found the number of executed cuda kernel are constant and the average decoder block perf during different bs are the same(Maybe it's not too strict to say the same, but the data shows that there is no big gap between different bs)。 - I also tried the code like
outputs = generator(["DeepSpeed is the", "It's a DeepSpeed kind of summer"], do_sample=True, min_length=50)
.I was using the nsight system to dump the kernel execution status.However, it's weird that the batch size seems doesn't works as expectation. Specifically, i suppose the kernels of decoder block are constant but it increased with input batch size simutaneously. Say for bs1, the model of gpt xl would call about 6000 cuda kernels for inference. Once we set the batch size to 16, then we observed the 96000 kernels (16x of bs 1)exectued. I'm wondering that it's expectation behavior or not.
Hi @gtvforever, this is a bit not obvious on the HF side. Here is how we've accomplished this (with or without deepspeed):
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_name = "gpt2-xl"
# Construct input prompts, in this case batch size will be 2
input_prompts = ['DeepSpeed is', "Seattle is in Washington"]
# Construct the tokenizer to encode w. padding if token counts differ across prompts
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0)
##
# you can add deepspeed.init_inference here to accelerate the model
##
response = pipe(input_prompts, do_sample=True, batch_size=len(input_prompts))
for r in response:
print(r)
@gtvforever please re-open if this isn't resolved.