fastertransformer_backend issues

Serving large models with FT backend keeps Triton server crashing and restarting

We are trying to run Triton with FasterTransformer backend on a GKE cluster with A100 GPUs to serve models such as T5, UL2 which are hosted on Google Cloud Storage...

RajeshThallam

T5: Triton Model Repository (containing model weights and configuration) on S3 doesn't work as expected

5

### Description ```shell It appears that Triton Server with Faster transformer backend doesn't work as expected when loading the model repository from S3 (containing both configuration and model weights). Release:...

dhaval24

bug

Config.pbtxt for all_models/t5/fastertransformer incorrect

1

### Description ```shell The Latest faster transformer v5.1.1 which is being used by the Fastertransformer backend latest release prescribes that T5 decoder output - [output_ids and sequence_length] should be int32...

dhaval24

bug

T5 cross_attention output cannot be accessed

1

### Description As defined in the [fastertransformers T5 guide](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/t5_guide.md) there is an output value for `cross_attentions`. I cannot find any way of returning `cross_attentions` on fastertransformers Triton backend for T5....

JustinAWei

bug

Not getting response with warning "response is nullptr"

1

### Description The problem: with "dynamic_batching" enabled, Triton inference server sometimes doesn't respond properly and logging "response is nullptr" several times, and sometimes crash. The model is a pretty standard...

t13m

bug

How can I get the logits of all tokens in vocab at each step?

6

Hey, thanks for providing such a great tool! I noticed that gpt_guide.md mentions a parameter: `output_log_probs` . It records the log probability of logits at each step for sampling. `output_log_probs`...

kevinlee819

GPT-J Preprocessing Incorrectly Tokenizes `<|endoftext|>`

8

### Description Expected behavior: ```shell >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B") >>> tokenizer.encode('') [50256] ``` ### Reproduced Steps Actual behavior: ```shell $ cd all_models/gptj/preprocessing/1 $ python >>>...

mitchellgordon95

bug

How much VRAM BLOOM consumes?

6

Hi, thanks for supporting the BLOOM model in the latest release of fastertransformer backend. I tried the latest code on my 8x A6000 GPU server with 48G ram per GPU...

pai4451

T5 not performing as expeceted

3

### Description ```shell I am trying to optimize T5-small inference using Fastertransformer. I am running on a single V100, I followed all the steps in `t5_guide.md` exactly and got a...

nrakltx

bug

Memory usage not going up with model instances

1

Hi, I am using this backend for inference with GPT-J model ([Codegen](https://github.com/salesforce/CodeGen) converted to GPT-J checkpoint to be precise). And I'm trying to load more than one model instances to...

samipdahalr

fastertransformer_backend
fastertransformer_backend copied to clipboard

Metadata

Serving large models with FT backend keeps Triton server crashing and restarting

T5: Triton Model Repository (containing model weights and configuration) on S3 doesn't work as expected

Config.pbtxt for all_models/t5/fastertransformer incorrect

T5 cross_attention output cannot be accessed

Not getting response with warning "response is nullptr"

How can I get the logits of all tokens in vocab at each step?

GPT-J Preprocessing Incorrectly Tokenizes `<|endoftext|>`

How much VRAM BLOOM consumes?

T5 not performing as expeceted

Memory usage not going up with model instances

← Metadata

Owner

Metadata

fastertransformer_backend fastertransformer_backend copied to clipboard

Metadata

← Metadata

Owner

Metadata

fastertransformer_backend
fastertransformer_backend copied to clipboard