airllm
airllm copied to clipboard
how to increase speed of inference
Hi, awesome project!
I am experimenting with using "unsloth/Meta-Llama-3.1-405B-Instruct-bnb-4bit" for inference. I am using 1 A100 GPU with 16 core CPU. However, inference time for one sentence takes 20+ minutes.
Is there any way to speed it up? Also is there anyway to process multiple text input together in a list to speed things up? Something like:
def get_output(input_text):
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=128,
padding=False)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
get_output([
'1+1 =',
# '20/20+19+4 =?'
# '50%100=',
# 'derivative of x^2'
])