fastertransformer_backend
fastertransformer_backend copied to clipboard
Getting empty response from GPT-J Model
Description
branch:main
docker_version:22.12
gpu: A5000
Reproduced Steps
Created the docker image and installed gpt-j model.. the model runs and load and server is running at port 8000..
the I run the following code to inference with model:
`import time
import numpy as np
import requests
import tritonclient.http as httpclient
from collections.abc import Mapping
from tritonclient.utils import np_to_triton_dtype
from transformers import AutoTokenizer
import random
import tqdm
from copy import deepcopy
DEFAULT_CONFIG = {
'protocol': 'http',
'url': f'localhost:8000',
'model_name': 'fastertransformer',
'verbose': False,
}
dtype = "uint32"
GENERATION_CONFIG = {
"request": [
{
"name": "input_ids",
"data": [],
"dtype": dtype
},
{
"name": "input_lengths",
"data": [],
"dtype": dtype
},
{
"name": "request_output_len",
"data": [[64]],
"dtype": dtype
},
{
"name": "beam_search_diversity_rate",
"data": [[0]],
"dtype": "float32"
},
{
"name": "temperature",
"data": [[0.72]],
"dtype": "float32"
},
{
"name": "repetition_penalty",
"data": [[1.13]],
"dtype": "float32"
},
{
"name": "beam_width",
"data": [[1]],
"dtype": dtype
},
{
"name": "random_seed",
"data": [[0]],
"dtype": "uint64"
},
{
"name": "runtime_top_k",
"data": [[0]],
"dtype": dtype
},
{
"name": "runtime_top_p",
"data": [[0.725]],
"dtype": "float32"
},
{
"name": "stop_words_list",
"data": [[[198], [1]]],
"dtype": "int32"
},
{
"name": "bad_words_list",
"data": [[[77, 15249, 77], [2, 5, 7]]],
"dtype": "int32"
}
]
}
padding_side = "left"
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side=padding_side)
tokenizer.pad_token_id = 50256
assert tokenizer.pad_token_id == 50256, 'incorrect padding token'
tokenizer.padding_side = padding_side
tokenizer.truncation_side = padding_side
def to_word_list_format(words):
flat_ids = []
offsets = []
item_flat_ids = []
item_offsets = []
for word in words:
ids = tokenizer.encode(word)
if len(ids) == 0:
continue
item_flat_ids += ids
item_offsets.append(len(ids))
flat_ids.append(np.array(item_flat_ids))
offsets.append(np.cumsum(np.array(item_offsets)))
pad_to = max(1, max(len(ids) for ids in flat_ids))
for i, (ids, offs) in enumerate(zip(flat_ids, offsets)):
flat_ids[i] = np.pad(ids, (0, pad_to - len(ids)), constant_values=0)
offsets[i] = np.pad(offs, (0, pad_to - len(offs)), constant_values=-1)
return np.array([flat_ids, offsets], dtype="int32").transpose((1, 0, 2))
def load_bad_word_ids():
forbidden = [
'samplebadword'
]
return to_word_list_format(forbidden)
GENERATION_CONFIG["request"][-1]["data"] = load_bad_word_ids()
def generate_parameters_from_texts(texts, random_seed=None):
params = deepcopy(GENERATION_CONFIG["request"])
inputs = tokenizer(texts, return_tensors="np", add_special_tokens=False, padding=True)
input_ids = inputs.input_ids
for index, value in enumerate(params):
if value['name'] == 'input_ids':
data = np.array([np.array(data) for data in input_ids], dtype=value['dtype'])
elif value['name'] == 'input_lengths':
value_data = [[len(sample_input_ids)] for sample_input_ids in input_ids]
data = np.array([data for data in value_data], dtype=value['dtype'])
elif value['name'] == 'random_seed':
if random_seed is None:
random_seed = random.randint(0, 10000)
data = np.array([[random_seed] for _ in range(len(input_ids))], dtype=value['dtype'])
else:
data = np.array([data for data in value['data']] * len(input_ids), dtype=value['dtype'])
params[index] = {
'name': value['name'],
'data': data,
}
return params
def prepare_tensor(client, name, input):
t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
t.set_data_from_numpy(input)
return t
def triton_inference(inference_client, texts, random_seed=None):
request = generate_parameters_from_texts(texts, random_seed)
payload = [prepare_tensor(httpclient, field['name'], field['data'])
for field in request]
result = inference_client.infer(DEFAULT_CONFIG['model_name'], payload)
output_texts = []
output_texts_cropped = []
for i, output in enumerate(result.get_response()['outputs']):
if output['name'] == "output_ids":
for output_ids in result.as_numpy(output['name']):
output_ids = [int(output_id) for output_id in list(output_ids[0])]
output_texts.append(tokenizer.decode(output_ids, skip_special_tokens=True).strip())
output_texts_cropped.append(
tokenizer.decode(
output_ids[len(request[0]["data"][i]):], skip_special_tokens=True
).strip()
)
return output_texts_cropped
def main():
client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=10)
#INPUT_EXAMPLES = dataset["train"]["text"][:2]
#example1 = INPUT_EXAMPLES[0]
#example2 = INPUT_EXAMPLES[1]
print(
triton_inference(client, ["I am going"], random_seed=0)
)
print(
triton_inference(client, ["I have a dog name"], random_seed=0)
)
if __name__ == "__main__":
main()`
The output it gives is: [""]
is there any issue on it or I am running the inference in wrong way?
@byshiue any help?
facing same issue but getting the output as [b'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!']
Do you encounter such error by using the example used in gptj_guide.md?
yes
Sorry to barge in here, but saw these comments and I am able to replicate this behaviour on an NVIDIA GeForce GTX 1080 Ti
using the GPT J Guide
- Branch:
origin/dev/t5_gptj_blog
- Version:
TRITON_VERSION=22.03
- GPU:
NVIDIA GeForce GTX 1080 Ti
- CUDA Version:
11.6.1
Server Snippet:
I0209 15:56:12.313059 108 grpc_server.cc:4421] Started GRPCInferenceService at 0.0.0.0:8001 I0209 15:56:12.344709 108 http_server.cc:3113] Started HTTPService at 0.0.0.0:8000 I0209 15:56:12.387174 108 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002 I0209 16:00:14.897347 108 libfastertransformer.cc:834] Start to forward I0209 16:00:14.897382 108 libfastertransformer.cc:834] Start to forward I0209 16:00:14.897539 108 libfastertransformer.cc:834] Start to forward I0209 16:00:14.897562 108 libfastertransformer.cc:834] Start to forward I0209 16:00:17.333362 108 libfastertransformer.cc:836] Stop to forward I0209 16:00:17.333413 108 libfastertransformer.cc:836] Stop to forward I0209 16:00:17.333472 108 libfastertransformer.cc:836] Stop to forward I0209 16:00:17.333564 108 libfastertransformer.cc:836] Stop to forward
Client Snippet:
Write any input prompt for the model and press ENTER: my name is [b'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!']
BTW, Thanks for all the great work with FT backend. Cheers!
Please provide the scripts to reproduce the error. (Don't refer the guide directly because there are often some mismatch between the scripts you used and the guide).
Thanks @byshiue for your time and looking into this, I did a point by point replication of the guide Ipython Notebook to ensure I have a working setup before I bring in my scripts. Please do let me know if I am missing something here.
Edit:
Updating to newer version of Faster Transformer and CUDA and building FT with lower SM 6.1 solved this issue for me. Thanks @byshiue
Hi @vax-dev,
I've been able to reproduce your setup, and I get correct results:
["to make a few comments about the book, but I will begin with an observation. The first is that I have read many books on quantum mechanics and relativity theory over the years. In fact, my dissertation was on Einstein's attempt to unify gravity and electromagnetism. My experience is that these theories are often"]
["Bob. I love him very much and he is the best thing that ever happened to me. But sometimes I get jealous of other dogs. For example, when my friend has another dog named Charlie. It makes me sad because it's not fair for me to be with someone else's dog!"]
Could you try to set the environment variable export FT_DEBUG_LEVEL=DEBUG
before starting tritonserver
? So that we can have better info what's going on on your machine.
In any case, your bad words list look wrong, i.e.:
{
"name": "bad_words_list",
"data": [[[77, 15249, 77], [2, 5, 7]]],
"dtype": "int32"
}
How did you generate it?