llama3 icon indicating copy to clipboard operation
llama3 copied to clipboard

How do models do batch inferring when using the transformer method?

Open code-isnot-cold opened this issue 1 year ago • 14 comments

I am a noob. Here is my code, how can I modify it to do batch inferring?


def load_model(): model_id = 'llama3/Meta-Llama-3-70B-Instruct' pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto", # return tokenizer, pipeline return pipeline

def get_response(pipeline, system_prompt, user_prompt): messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ]

prompt = pipeline.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=4096,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

code-isnot-cold avatar Apr 22 '24 11:04 code-isnot-cold

@ArthurZucker would the list of messages make it?

HamidShojanazeri avatar Apr 24 '24 17:04 HamidShojanazeri

It doesn't seem to work.
Reasons:

  1. Inference time is the same as a single inference,
  2. console warnings appear one by one, it can be inferred that the model is read one by one image

Here is the code for batch inference:

def load_model():
    model_id = '/home/pengwj/programs/llama3/Meta-Llama-3-70B-Instruct'
    # tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="auto",
    )
    return pipeline

#batch_system_prompt = [[],[],[],[]] ; sections = = [[],[],[],[]]
messages = [[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}] for system_prompt, user_prompt in zip(batch_system_prompt, sections)]

# prompt =  [[],[],[],[]]
prompt = pipeline.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=1024,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

code-isnot-cold avatar Apr 25 '24 01:04 code-isnot-cold

cc @Rocketknight1

ArthurZucker avatar May 23 '24 09:05 ArthurZucker

Hi @code-isnot-cold, great question! The short answer is that the text generation pipeline will only generate one sample at a time, so you won't gain any benefit from batching samples together. If you want to generate in a batch, you'll need to use the lower-level method model.generate() instead, and it's slightly more complex. However, you can definitely get performance benefits from it.

You'll need to tokenize with padding_side="left", and padding="longest", and you'll need to set a pad_token_id. The reason for this is that the sequences will have different lengths when you batch them together. Try this code snippet:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

input1 = [{"role": "user", "content": "Hi, how are you?"}]
input2 = [{"role": "user", "content": "How are you feeling today?"}]
texts = tokenizer.apply_chat_template([input1, input2], add_generation_prompt=True, tokenize=False)

tokenizer.pad_token_id = tokenizer.eos_token_id  # Set a padding token
inputs = tokenizer(texts, padding="longest", return_tensors="pt")
inputs = {key: val.to(model.device) for key, val in inputs.items()}

model.generate(**inputs, max_new_tokens=512)

Rocketknight1 avatar May 23 '24 13:05 Rocketknight1

Thank you for your detailed explanation @Rocketknight1 . I have started using the vllm method, which enables efficient inference. But I'll try to use the model.generate() method for batch generation. Thanks again for your help @ArthurZucker

code-isnot-cold avatar May 24 '24 07:05 code-isnot-cold

my pleasure! 🤗

ArthurZucker avatar May 24 '24 12:05 ArthurZucker

I wrote my code based on @Rocketknight1 's. I am a transformers beginner and I hope that there isn't any bug in my code. Code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model_id = "/path/to/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side = "left")
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

myinput=[
    [{"role": "user", "content": "1 + 1 = "}],
    [{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}],
    [{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}],
    [{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}],
    [{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}],
    [{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}]
]
texts = tokenizer.apply_chat_template(myinput, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(texts, padding="longest", return_tensors="pt")
inputs = {key: val.cuda() for key, val in inputs.items()}
temp_texts=tokenizer.batch_decode(inputs["input_ids"], skip_special_tokens=True)

start_time = time.time()
gen_tokens = model.generate(
    **inputs, 
    max_new_tokens=512, 
    pad_token_id=tokenizer.eos_token_id, 
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)
print(f"Time: {time.time()-start_time}")

gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
gen_text = [i[len(temp_texts[idx]):] for idx, i in enumerate(gen_text)]
print(gen_text)

Output:

Time: 2.219297409057617
['2', 'C++ is a powerful, compiled, object-oriented programming language.', 'George Washington, first president of the United States.', 'The capital of France is Paris.', 'Scattered sunlight by tiny molecules in atmosphere.', 'To find purpose, happiness, and fulfillment through experiences.', 'Immerse yourself in the language through listening and speaking.', "In your area's dormant season, typically late winter or early spring.", 'Poach it in simmering water for a perfect yolk.', 'There is no single "best" language, it depends on context.']

mirrorboat avatar May 26 '24 07:05 mirrorboat

Thank you for your detailed explanation @Rocketknight1 . I have started using the vllm method, which enables efficient inference. But I'll try to use the model.generate() method for batch generation. Thanks again for your help @ArthurZucker

Would you please share your llama3 vllm inference code? I've search it in https://github.com/meta-llama/llama-recipes but failed to find a suitable script.

mirrorboat avatar May 29 '24 12:05 mirrorboat

Sure, This is a website for your reference: https://docs.vllm.ai/en/stable/getting_started/quickstart.html. I find that vllm seems to be inferior to transformers method in batch inference. Maybe there is something wrong with my code, please communicate more after trying it

from vllm import SamplingParams, LLM
import time

model_id = "/path/to/Meta-Llama-3-70B-Instruct"
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=512,)
llm = LLM(model=model_id, tensor_parallel_size=4)

tokenizer = llm.get_tokenizer()
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eos_token_id
llm.set_tokenizer(tokenizer)

prompts = [
    "1 + 1 = ",
    "Introduce C++ in one short sentence less than 10 words.",
    "Who was the first president of the United States? Answer in less than 10 words.",
    "What is the capital of France ? Answer in less than 10 words.",
    "Why is the sky blue ? Answer in less than 10 words.",
    "What is the meaning of life? Answer in less than 10 words.",
    "What is the best way to learn a new language? Answer in less than 10 words.",
    "When is the best time to plant a tree? Answer in less than 10 words.",
    "What is the best way to cook an egg? Answer in less than 10 words.",
    "Which is the best programming language? Answer in less than 10 words."
]

start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print(f"Time: {time.time() - start_time}")

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

code-isnot-cold avatar May 29 '24 13:05 code-isnot-cold

https://github.com/vllm-project/vllm/issues/4180#issuecomment-2066004748 https://github.com/vllm-project/vllm/issues/4180#issuecomment-2074017550 Here @code-isnot-cold

from vllm import SamplingParams, LLM

model_path = "/path/to/Meta-Llama-3-8B-Instruct"

model = LLM(
    model=model_path,
    trust_remote_code=True,
    tensor_parallel_size=1,
)
tokenizer = model.get_tokenizer()

myinput=[
    [{"role": "user", "content": "1 + 1 = "}],
    [{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}],
    [{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}],
    [{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}],
    [{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}],
    [{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}]
]

conversations = tokenizer.apply_chat_template(
    myinput,
    tokenize=False,
)

outputs = model.generate(
    conversations,
    SamplingParams(
        temperature=0.6,
        top_p=0.9,
        max_tokens=512,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

for output in outputs:
    # prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"{generated_text!r}")

mirrorboat avatar May 29 '24 15:05 mirrorboat

I read the issue and tried your code, which worked perfectly. Thank you for your contribution

code-isnot-cold avatar May 30 '24 15:05 code-isnot-cold