DeepSpeed
DeepSpeed copied to clipboard
[BUG]After using the code that supports llama inference, the result of the inference is different from the original one
When I asked "Who is founder of goolge.com?", the result of llama13B answered as shown in the figure below:
“tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro tro accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accur accu....”
It was run on two V100s, and the configuration was as follows
model = deepspeed.init_inference( model=model, mp_size=2, dtype=torch.float16, replace_method="auto", replace_with_kernel_inject=True, )
Hi, I can reproduce.
import torch
import deepspeed
from transformers import AutoModelForCausalLM
from transformers.models.llama import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
model = deepspeed.init_inference(
model,
mp_size=2,
dtype=torch.half,
replace_with_kernel_inject=True
)
batch = tokenizer(
"The primary use of LLaMA is research on large language models, including",
return_tensors="pt",
add_special_tokens=False
)
batch = {k: v.cuda() for k, v in batch.items()}
generated = model.generate(batch["input_ids"], max_length=100)
print(tokenizer.decode(generated[0]))
Output:
The primary use of LLaMA is research on large language models, including Exchange Exchange Exchange Exchangedependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependencies
This was using deepspeed @ 4e886f0568832d292183926bcc1a9105def25f2c
Hi, I can reproduce.
import torch import deepspeed from transformers import AutoModelForCausalLM from transformers.models.llama import LlamaTokenizer tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf") model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf") model = deepspeed.init_inference( model, mp_size=2, dtype=torch.half, replace_with_kernel_inject=True ) batch = tokenizer( "The primary use of LLaMA is research on large language models, including", return_tensors="pt", add_special_tokens=False ) batch = {k: v.cuda() for k, v in batch.items()} generated = model.generate(batch["input_ids"], max_length=100) print(tokenizer.decode(generated[0]))
Output:
The primary use of LLaMA is research on large language models, including Exchange Exchange Exchange Exchangedependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependenciesdependencies
This was using deepspeed @ 4e886f0
hellow,I can only infer one string once ,how to use while True: prompts = input() ...
Please help me ,I really confused