llama3
                                
                                 llama3 copied to clipboard
                                
                                    llama3 copied to clipboard
                            
                            
                            
                        How do models do batch inferring when using the transformer method?
I am a noob. Here is my code, how can I modify it to do batch inferring?
def load_model(): model_id = 'llama3/Meta-Llama-3-70B-Instruct' pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto", # return tokenizer, pipeline return pipeline
def get_response(pipeline, system_prompt, user_prompt): messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ]
prompt = pipeline.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
    prompt,
    max_new_tokens=4096,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
@ArthurZucker would the list of messages make it?
It doesn't seem to work.
Reasons:
- Inference time is the same as a single inference,
- console warnings appear one by one, it can be inferred that the model is read one by one
Here is the code for batch inference:
def load_model():
    model_id = '/home/pengwj/programs/llama3/Meta-Llama-3-70B-Instruct'
    # tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="auto",
    )
    return pipeline
#batch_system_prompt = [[],[],[],[]] ; sections = = [[],[],[],[]]
messages = [[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}] for system_prompt, user_prompt in zip(batch_system_prompt, sections)]
# prompt =  [[],[],[],[]]
prompt = pipeline.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
    prompt,
    max_new_tokens=1024,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
cc @Rocketknight1
Hi @code-isnot-cold, great question! The short answer is that the text generation pipeline will only generate one sample at a time, so you won't gain any benefit from batching samples together. If you want to generate in a batch, you'll need to use the lower-level method model.generate() instead, and it's slightly more complex. However, you can definitely get performance benefits from it.
You'll need to tokenize with padding_side="left", and padding="longest", and you'll need to set a pad_token_id. The reason for this is that the sequences will have different lengths when you batch them together. Try this code snippet:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
input1 = [{"role": "user", "content": "Hi, how are you?"}]
input2 = [{"role": "user", "content": "How are you feeling today?"}]
texts = tokenizer.apply_chat_template([input1, input2], add_generation_prompt=True, tokenize=False)
tokenizer.pad_token_id = tokenizer.eos_token_id  # Set a padding token
inputs = tokenizer(texts, padding="longest", return_tensors="pt")
inputs = {key: val.to(model.device) for key, val in inputs.items()}
model.generate(**inputs, max_new_tokens=512)
Thank you for your detailed explanation @Rocketknight1 . I have started using the vllm method, which enables efficient inference. But I'll try to use the model.generate() method for batch generation. Thanks again for your help @ArthurZucker
my pleasure! 🤗
I wrote my code based on @Rocketknight1 's. I am a transformers beginner and I hope that there isn't any bug in my code. Code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
model_id = "/path/to/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side = "left")
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
myinput=[
    [{"role": "user", "content": "1 + 1 = "}],
    [{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}],
    [{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}],
    [{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}],
    [{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}],
    [{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}]
]
texts = tokenizer.apply_chat_template(myinput, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(texts, padding="longest", return_tensors="pt")
inputs = {key: val.cuda() for key, val in inputs.items()}
temp_texts=tokenizer.batch_decode(inputs["input_ids"], skip_special_tokens=True)
start_time = time.time()
gen_tokens = model.generate(
    **inputs, 
    max_new_tokens=512, 
    pad_token_id=tokenizer.eos_token_id, 
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)
print(f"Time: {time.time()-start_time}")
gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
gen_text = [i[len(temp_texts[idx]):] for idx, i in enumerate(gen_text)]
print(gen_text)
Output:
Time: 2.219297409057617
['2', 'C++ is a powerful, compiled, object-oriented programming language.', 'George Washington, first president of the United States.', 'The capital of France is Paris.', 'Scattered sunlight by tiny molecules in atmosphere.', 'To find purpose, happiness, and fulfillment through experiences.', 'Immerse yourself in the language through listening and speaking.', "In your area's dormant season, typically late winter or early spring.", 'Poach it in simmering water for a perfect yolk.', 'There is no single "best" language, it depends on context.']
Thank you for your detailed explanation @Rocketknight1 . I have started using the vllm method, which enables efficient inference. But I'll try to use the model.generate() method for batch generation. Thanks again for your help @ArthurZucker
Would you please share your llama3 vllm inference code? I've search it in https://github.com/meta-llama/llama-recipes but failed to find a suitable script.
Sure, This is a website for your reference: https://docs.vllm.ai/en/stable/getting_started/quickstart.html. I find that vllm seems to be inferior to transformers method in batch inference. Maybe there is something wrong with my code, please communicate more after trying it
from vllm import SamplingParams, LLM
import time
model_id = "/path/to/Meta-Llama-3-70B-Instruct"
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=512,)
llm = LLM(model=model_id, tensor_parallel_size=4)
tokenizer = llm.get_tokenizer()
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eos_token_id
llm.set_tokenizer(tokenizer)
prompts = [
    "1 + 1 = ",
    "Introduce C++ in one short sentence less than 10 words.",
    "Who was the first president of the United States? Answer in less than 10 words.",
    "What is the capital of France ? Answer in less than 10 words.",
    "Why is the sky blue ? Answer in less than 10 words.",
    "What is the meaning of life? Answer in less than 10 words.",
    "What is the best way to learn a new language? Answer in less than 10 words.",
    "When is the best time to plant a tree? Answer in less than 10 words.",
    "What is the best way to cook an egg? Answer in less than 10 words.",
    "Which is the best programming language? Answer in less than 10 words."
]
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print(f"Time: {time.time() - start_time}")
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
https://github.com/vllm-project/vllm/issues/4180#issuecomment-2066004748 https://github.com/vllm-project/vllm/issues/4180#issuecomment-2074017550 Here @code-isnot-cold
from vllm import SamplingParams, LLM
model_path = "/path/to/Meta-Llama-3-8B-Instruct"
model = LLM(
    model=model_path,
    trust_remote_code=True,
    tensor_parallel_size=1,
)
tokenizer = model.get_tokenizer()
myinput=[
    [{"role": "user", "content": "1 + 1 = "}],
    [{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}],
    [{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}],
    [{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}],
    [{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}],
    [{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}]
]
conversations = tokenizer.apply_chat_template(
    myinput,
    tokenize=False,
)
outputs = model.generate(
    conversations,
    SamplingParams(
        temperature=0.6,
        top_p=0.9,
        max_tokens=512,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)
for output in outputs:
    # prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"{generated_text!r}")
I read the issue and tried your code, which worked perfectly. Thank you for your contribution