fastertransformer_backend Getting empty response from GPT-J Model

Description

branch:main
docker_version:22.12
gpu: A5000

Reproduced Steps

Created the docker image and installed gpt-j model.. the model runs and load and server is running at port 8000..
the I run the following code to inference with model:

`import time
import numpy as np
import requests
import tritonclient.http as httpclient
from collections.abc import Mapping
from tritonclient.utils import np_to_triton_dtype
from transformers import AutoTokenizer
import random
import tqdm
from copy import deepcopy


DEFAULT_CONFIG = {
    'protocol': 'http',
    'url': f'localhost:8000',
    'model_name': 'fastertransformer',
    'verbose': False,
}

dtype = "uint32"

GENERATION_CONFIG = {
    "request": [
        {
            "name": "input_ids",
            "data": [],
            "dtype": dtype
        },
        {
            "name": "input_lengths",
            "data": [],
            "dtype": dtype
        },
        {
            "name": "request_output_len",
            "data": [[64]],
            "dtype": dtype
        },
        {
            "name": "beam_search_diversity_rate",
            "data": [[0]],
            "dtype": "float32"
        },
        {
            "name": "temperature",
            "data": [[0.72]],
            "dtype": "float32"
        },
        {
            "name": "repetition_penalty",
            "data": [[1.13]],
            "dtype": "float32"
        },
        {
            "name": "beam_width",
            "data": [[1]],
            "dtype": dtype
        },
        {
            "name": "random_seed",
            "data": [[0]],
            "dtype": "uint64"
        },
        {
            "name": "runtime_top_k",
            "data": [[0]],
            "dtype": dtype
        },
        {
            "name": "runtime_top_p",
            "data": [[0.725]],
            "dtype": "float32"
        },
        {
            "name": "stop_words_list",
            "data": [[[198], [1]]],
            "dtype": "int32"
        },
        {
            "name": "bad_words_list",
            "data": [[[77, 15249, 77], [2, 5, 7]]],
            "dtype": "int32"
        }
    ]
}

padding_side = "left"
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side=padding_side)
tokenizer.pad_token_id = 50256
assert tokenizer.pad_token_id == 50256, 'incorrect padding token'
tokenizer.padding_side = padding_side
tokenizer.truncation_side = padding_side

def to_word_list_format(words):
    flat_ids = []
    offsets = []
    item_flat_ids = []
    item_offsets = []

    for word in words:
        ids = tokenizer.encode(word)

        if len(ids) == 0:
            continue

        item_flat_ids += ids
        item_offsets.append(len(ids))

    flat_ids.append(np.array(item_flat_ids))
    offsets.append(np.cumsum(np.array(item_offsets)))

    pad_to = max(1, max(len(ids) for ids in flat_ids))

    for i, (ids, offs) in enumerate(zip(flat_ids, offsets)):
        flat_ids[i] = np.pad(ids, (0, pad_to - len(ids)), constant_values=0)
        offsets[i] = np.pad(offs, (0, pad_to - len(offs)), constant_values=-1)

    return np.array([flat_ids, offsets], dtype="int32").transpose((1, 0, 2))

def load_bad_word_ids():
    forbidden = [
        'samplebadword'
		]

    return to_word_list_format(forbidden)

GENERATION_CONFIG["request"][-1]["data"] = load_bad_word_ids()

def generate_parameters_from_texts(texts, random_seed=None):
    params = deepcopy(GENERATION_CONFIG["request"])
    inputs = tokenizer(texts, return_tensors="np", add_special_tokens=False, padding=True)
    input_ids = inputs.input_ids
    for index, value in enumerate(params):

        if value['name'] == 'input_ids':
            data = np.array([np.array(data) for data in input_ids], dtype=value['dtype'])
        elif value['name'] == 'input_lengths':
            value_data = [[len(sample_input_ids)] for sample_input_ids in input_ids]
            data = np.array([data for data in value_data], dtype=value['dtype'])
        elif value['name'] == 'random_seed':
            if random_seed is None:
                random_seed = random.randint(0, 10000)
            data = np.array([[random_seed] for _ in range(len(input_ids))], dtype=value['dtype'])
        else:
            data = np.array([data for data in value['data']] * len(input_ids), dtype=value['dtype'])

        params[index] = {
            'name': value['name'],
            'data': data,
        }
    return params

def prepare_tensor(client, name, input):
    t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

def triton_inference(inference_client, texts, random_seed=None):
    request = generate_parameters_from_texts(texts, random_seed)
    payload = [prepare_tensor(httpclient, field['name'], field['data'])
               for field in request]
    result = inference_client.infer(DEFAULT_CONFIG['model_name'], payload)
    output_texts = []
    output_texts_cropped = []

    for i, output in enumerate(result.get_response()['outputs']):
        if output['name'] == "output_ids":
            for output_ids in result.as_numpy(output['name']):
                output_ids = [int(output_id) for output_id in list(output_ids[0])]
                output_texts.append(tokenizer.decode(output_ids, skip_special_tokens=True).strip())
                output_texts_cropped.append(
                    tokenizer.decode(
                        output_ids[len(request[0]["data"][i]):], skip_special_tokens=True
                    ).strip()
                )
    return output_texts_cropped

def main():
    client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=10)

    #INPUT_EXAMPLES = dataset["train"]["text"][:2]
    #example1 = INPUT_EXAMPLES[0]
    #example2 = INPUT_EXAMPLES[1]

    print(
        triton_inference(client, ["I am going"], random_seed=0)
    )

    print(
        triton_inference(client, ["I have a dog name"], random_seed=0)
    )


if __name__ == "__main__":
    main()`

The output it gives is: [""]

is there any issue on it or I am running the inference in wrong way?

Feb 02 '23 07:02 vax-dev

@byshiue any help?

Feb 02 '23 09:02 vax-dev

facing same issue but getting the output as [b'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!']

Feb 08 '23 17:02 gd1m3y

Do you encounter such error by using the example used in gptj_guide.md?

Feb 09 '23 07:02 byshiue

yes

Feb 09 '23 07:02 gd1m3y

Sorry to barge in here, but saw these comments and I am able to replicate this behaviour on an NVIDIA GeForce GTX 1080 Ti using the GPT J Guide

Branch: origin/dev/t5_gptj_blog
Version: TRITON_VERSION=22.03
GPU: NVIDIA GeForce GTX 1080 Ti
CUDA Version: 11.6.1

Server Snippet: I0209 15:56:12.313059 108 grpc_server.cc:4421] Started GRPCInferenceService at 0.0.0.0:8001 I0209 15:56:12.344709 108 http_server.cc:3113] Started HTTPService at 0.0.0.0:8000 I0209 15:56:12.387174 108 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002 I0209 16:00:14.897347 108 libfastertransformer.cc:834] Start to forward I0209 16:00:14.897382 108 libfastertransformer.cc:834] Start to forward I0209 16:00:14.897539 108 libfastertransformer.cc:834] Start to forward I0209 16:00:14.897562 108 libfastertransformer.cc:834] Start to forward I0209 16:00:17.333362 108 libfastertransformer.cc:836] Stop to forward I0209 16:00:17.333413 108 libfastertransformer.cc:836] Stop to forward I0209 16:00:17.333472 108 libfastertransformer.cc:836] Stop to forward I0209 16:00:17.333564 108 libfastertransformer.cc:836] Stop to forward

Client Snippet:

Write any input prompt for the model and press ENTER: my name is [b'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!']

BTW, Thanks for all the great work with FT backend. Cheers!

Feb 09 '23 19:02 AashishTiwari

Please provide the scripts to reproduce the error. (Don't refer the guide directly because there are often some mismatch between the scripts you used and the guide).

Feb 10 '23 08:02 byshiue

Thanks @byshiue for your time and looking into this, I did a point by point replication of the guide Ipython Notebook to ensure I have a working setup before I bring in my scripts. Please do let me know if I am missing something here.

Edit:

Updating to newer version of Faster Transformer and CUDA and building FT with lower SM 6.1 solved this issue for me. Thanks @byshiue

Feb 10 '23 08:02 AashishTiwari

Hi @vax-dev,

I've been able to reproduce your setup, and I get correct results:

["to make a few comments about the book, but I will begin with an observation. The first is that I have read many books on quantum mechanics and relativity theory over the years. In fact, my dissertation was on Einstein's attempt to unify gravity and electromagnetism. My experience is that these theories are often"]
["Bob. I love him very much and he is the best thing that ever happened to me. But sometimes I get jealous of other dogs. For example, when my friend has another dog named Charlie. It makes me sad because it's not fair for me to be with someone else's dog!"]

Could you try to set the environment variable export FT_DEBUG_LEVEL=DEBUG before starting tritonserver? So that we can have better info what's going on on your machine.

In any case, your bad words list look wrong, i.e.:

{
    "name": "bad_words_list",
    "data": [[[77, 15249, 77], [2, 5, 7]]],
    "dtype": "int32"
}

How did you generate it?

Mar 01 '23 16:03 mickaelseznec